Unstructured data vs Personal Information


With the start of GDPR enforcement getting so close that it is on the same calendar page as today, we're all being reminded how much personal information is scattered through our organizations and databases.

Auditing a relational database for personal information, PI, is typically a process of pulling out the database schemas and examining the structures to make sure there's no un-noticed column which may be harboring data of a personal nature. Just make sure you have good tools to dump and search those definitions, or better still, have the original source that created them.

But, it's 2018 and the one thing we've noticed that makes the process of tracking down personal information in databases harder is the use of unstructured data like JSON documents.

The unstructured dimension

Consider, for example, a customer management application that logs all of the customer's interaction with the business. The first pass of checking the database there will likely the obvious PI; names, addresses, phone numbers.

But if there's a hint of unstructured data, you've still got a long way to go. This doesn't just apply to NoSQL databases which store JSON documents either. Unstructured data can also be held in relational databases in arrays, hash stores, and even JSON fields; they are all part of a modern relational database.

The problem with unstructured data is that its strength in flexibility also means that personal information could be anywhere in a document or within a record. And it's not just about how you currently use the data.

The documents in a JSON document store or held in JSON datatypes in a relational store reflect all the changes that have happened over the entire lifetime of the system. For example, our customer management system might have once recorded incoming call numbers and times in a section of a JSON document attached to the customer's account. Since that feature was abandoned, it's no longer in current use, but inside the JSON documents of customers who had their numbers recorded, that information persists. (As an aside, this is a useful reminder to clean up your dataset if you do remove features from your application).

Near lost-data like this is often unlikely to be clearly labeled either so just scanning the keys of a JSON document wouldn't be enough to smoke it out. You'd likely have to search for values which appear to contain phone numbers, email addresses, social security numbers and other sub-structures of data which would help locate it.

The other, other PI

Other personal information may be found in attachments as well. Look out for image attachments which record correspondence for example. Those attachments may be hard to process but they can be regarded as PI if they include protected details.

And if JSON documents and attachments aren't enough to be going on with, consider say a Redis database being used as a persistent fast cache for user details. With that scenario, you would have to search all the values in the store for PI if you think it might have got in there as part of a caching operation.

You may even have to search unexpected places for PI. For example, keys in a key/value store can end up holding PI. What was once a username used to key data became an email address as the developers refined how people interacted with the system. Now though, there are email addresses in the keys of the database. You don't want embedded PI, but that's how you get embedded PI.

What this all comes down to is some very complex searches through text fields and JSON documents. Complex and, most likely unindexed searches. And those are the kinds of queries that can really interfere with your production databases.

Enter the restored backup

The ideal way to solve this would be to have a duplicate database server to run your PI auditing on, where you can record all the incidents of PI that your complex queries can dig out, without putting a load on your production database. Of course, most people don't set up a perfectly parallel database clone so the other way to do this is to take a recent backup and restore that into a new instance of the database.

Compose users can make use of our restore from backup to create a new deployment, ready loaded with their data from a selected scheduled backup or on-demand backup. It'll be up and running automatically in no time and you'll be able to do that PI audit, find the problematic data and prepare a plan to update your production systems in no time. And when you are done, you can just delete the new deployment and have only paid for the hours or days it existed.

Lessons to be learned

Although unstructured data can be a blessing for a number of user cases and during the rapid development of a new application, it also comes with responsibilities. Historically, you need to know what PI is in your documents and unstructured fields so you can report on it.

Looking forward, you'll need to be able to say that you are accounting for any use or presence of PI in your databases and throughout your processes. Having an easily duplicated database is just one tactic you can employ to make sure accounting for PI in your systems doesn't impact on production.

Read more articles about Compose databases - use our Curated Collections Guide for articles on each database type. If you have any feedback about this or any other Compose article, drop the Compose Articles team a line at articles@compose.com. We're happy to hear from you.

attribution Carson Arias

Dj Walker-Morgan
Dj Walker-Morgan was Compose's resident Content Curator, and has been both a developer and writer since Apples came in II flavors and Commodores had Pets. Love this article? Head over to Dj Walker-Morgan’s author page to keep reading.

Conquer the Data Layer

Spend your time developing apps, not managing databases.