5.1 Taking a look at data sets – Let’s find Maria
Let’s start with looking at a database. It’s an imaginary database, but let’s say it covers the data of a social welfare program.
|Name||Age||Zip Code||Gender||Number of Children||School Degree||Average Income / month||Social Benefit received / Month|
|Ana Luisa Gil||18||74923||Female||2||—||130||800|
You can see that for someone who knows that a friend, a neighbour, an ex wife, an enemy receives social benefit from the state it is very easy to find that person in this data set and get more information, such as the actual amount the person is receiving or how much the person is earning. Hence, this data set in the way it is presented here can pose a high risk once it gets in the wrong hands.
De-identificacion is a way of changing data sets so that it is not that easy anymore to identify individuals in the dataset. A first steps of de-identification is pseudonymization. Pseudonymization means removing direct identifiers and exchanging them for random data. As you already learned in an earlier session, name and the addresses are personal identifiable information and direct identifiers because without any effort they can be linked back to the individual.
So let’s modify these direct identifiers in order to pseudonomize this data set:
Download the Excel sheet and perform the following tasks:
- Delete all the content in the column of the addresses
- Second: Now, use this column to write down a random ID numbers of six digits for each line (those will later replace the names). It is important to not just put numbers like 1, 2, 3, 4 etc. but to put numbers with more digits that you select randomly. Otherwise it will be very easy to relink someone in the data set. (Usually, statistical tools help you to generate these random numbers)
- You could now easily store the two columns (name, ID number) in a secure place, in case you ever need to link back the information.
- Now delete the column “name”.
This what the first row of the table should look like after step 2:
|Name||ID Number||Age||Zip Code||Gender||Number of Children||School Degree||Average Income / month||Social Benefit received / Month|
This is the information you can store as a key somewhere else in case you need to link back the name to the data set:
This is what your new data set should look like:
|ID Number||Age||Zip Code||Gender||Number of Children||School Degree||Average Income / month||Social Benefit received / Month|
Perfect, you have successfully pseudonymized your data set. Now, it is a lot harder to identify someone in the list. Pseudonymization is an easy first step to increase protection.
However, what happens, if someone knows that Maria Sanchez is in the data set, and the person also knows that she is 41 and that she has three children? Sort the table by age group and number of children, and see if you can find out how much Maria received from the government.
It turns out that Maria is the only woman who meets these characteristics. So, although the data was de-identified it would still be possible to identify Maria in the data set. This is why pseudonymization is a first step, but it is by far not enough to protect individuals in the data sets. Let’s try to add an additional protection layer, by using a technique that is part of anonymization:
Change the age into an age range (16-25, 26-35, 36-45, 46-55, 56-65).
Are you still able to find Maria?
As you can see, although modifying the data to such an extend, it is still possible to identify someone. This is because even indirect personal identifiers that are not unique to one person can allow you to identify someone in a data set. And the more indirect personal identifiers a data set contains, the more likely it is that in their combination they are unique to only one individual.