#GDPR is coming Security is a feature #2 Data Masking

#GDPR is coming Security is a feature #2 Data Masking

In the context of GDPR, data masking should be done so that the all the data attributes on a person should not be able to define that person. Most of us know this, but we rarely articulate it; as you add more attributes to data it becomes much easier to define a person.

Use case: my full name is Robert Price Lockard, but let’s say there are a whole bunch of Roberts’ in the universe, therefore Robert does not uniquely identify me. Now let’s start adding attributes.

Let’s go further. I live in the United States of America; that narrows it down even more. Here is the frequency distribution of the top 10 names in the USA. So, I’m one of 5,497,484 Roberts’ in the US.

Rank Name Frequency
1 John 7,556,152
2 Marry 7,474,295
3 James 5,714,116
4 Robert 5,497,484
5 Michael 4,942,065
6 Christopher 4,747,669
7 William 4,665,950
8 Joseph 4,619,701
9 Elizabeth 4,270,062
10 Richard 4,109,367

I am a Pilot, okay, we’ve narrowed down the universe of Roberts’ to Roberts who are a pilot. There are still quite a few Roberts’ out there who are pilots.

Okay, so we have this big universe of Roberts’, how about we add my birth date of May 26, 1960. (Yea’ I’m pretty old). Now we are starting to narrow it down a bit, but in 1960, Robert was a very popular name and still is, so there are still quite a few Roberts’ in the population of Pilots and born May 26, 1960.

I live in the state of Maryland; we are narrowing down down the universe of Roberts’ even more. Say I purchase a lot of stuff from Aircraft Spruce and Specialty Company that would only be used for a 1948 Ryan Navion. With this information, you can likely make an educated guess to uniquely identify me.

Now let’s add in my zip code (postal code) of 21060. You just identified me. This is where things start to get tricky (or interesting depending on your point of view.)

Robert — there are a bunch

Pilot — there still are a bunch

USA — fewer, but still a bunch

Date of Birth — narrowed it down, but still not enough info to identify me.

State of Maryland — narrowing it down quite a bit, and most likely take an educated guess.

Zip code 21060 — nailed it, now you know who I am..

Now when we start thinking about masking data in the context of GDPR, we are also going to need to look at the type of business we are dealing with. If you are a bank, you may be able to mask down to zip code, this is because the vast majority of people in a zip code (postal code) have bank accounts.

But if you are a company like Aircraft Spruce and Specialty company (that gets a lot of my money) you should mask zip code and even go as far as masking the state. Why, because if you look at my purchase history from Aircraft Spruce, you would be able to determine the type of aircraft I own. And because my aircraft is pretty rare (I don’t know of any other 1948 Ryan Navions’ in the state of Maryland) it would be easy to determine who I am based on my purchase history.

Why would we mask data anyway? If you are using production to refresh lower environments; then you really need to start masking your data. This has as much to do with GDPR and your overall security profile; because guess what; when attacking a system only an amateur would go after your production system first. A hacker is coming for your DEV and TEST systems first, these environments are real noisy and it’s quite easy to hide in them for months on end. So if you are not masking your data in lower environments, then start doing it now. The other reason is under the “right to be forgotten.” It would be much easier to clean your production data of a person then to have to go through all of your lower environments to find all instances of a person to clean.


Name Frequency Distribution

This entry was posted in Database Stuff. Bookmark the permalink.

Leave a Reply