5.1 Taking a look at data sets – Let’s find Maria

Let’s start with looking at a database. It’s an imaginary database, but let’s say it covers the data of a social welfare program.

Name Age Zip Code Gender Number of Children School Degree Average Income / month Social Benefit received / Month
André Müller 41 67593 Male 3 Secondary 730 400
Eva Nuñez 22 93648 Female 1 Primary 320 300
Rick Geary 49 54930 Male 6 Secondary 1300 639
Philippe Lourence 25 3682 Male 2 Primacy 683 280
Maria Sanchez 41 53782 Female 3 Primary 490 500
Mike Naumann 33 74823 Male 5 Primacy 1254
Karl Kutter 23 84944 Male 3 University 982 320
Nina King 28 84934 Female 4 Secondary 730 520
Brain Right 24 23989 Male 2 1200
Nigel Winter 37 74849 Male 3 Secondary 821 426
Ana Luisa Gil 18 74923 Female 2 130 800
Catalina Florez 36 16383 Female 1 Primary 647 101

 

You can see that for someone who knows that a friend, a neighbour, an ex wife, an enemy receives social benefit from the state it is very easy to find that person in this data set and get more information, such as the actual amount the person is receiving or how much the person is earning. Hence, this data set in the way it is presented here can pose a high risk once it gets in the wrong hands.

De-identificacion is a way of changing data sets so that it is not that easy anymore to identify individuals in the dataset. A first steps of de-identification is pseudonymization. Pseudonymization means removing direct identifiers and exchanging them for random data. As you already learned in an earlier session, name and the addresses are personal identifiable information and direct identifiers because without any effort they can be linked back to the individual.

So let’s modify these direct identifiers in order to pseudonomize this data set:

Download the Excel sheet and perform the following tasks:

  1. Delete all the content in the column of the addresses
  2. Second: Now, use this column to write down a random ID numbers of six digits for each line (those will later replace the names). It is important to not just put numbers like 1, 2, 3, 4 etc. but to put numbers with more digits that you select randomly. Otherwise it will be very easy to relink someone in the data set. (Usually, statistical tools help you to generate these random numbers)
  3. You could now easily store the two columns (name, ID number) in a secure place, in case you ever need to link back the information.
  4. Now delete the column “name”.

This what the first row of the table should look like after step 2:

 

This is the information you can store as a key somewhere else in case you need to link back the name to the data set:

 

This is what your new data set should look like:

 

Perfect, you have successfully pseudonymized your data set. Now, it is a lot harder to identify someone in the list. Pseudonymization is an easy first step to increase protection.

However, what happens, if someone knows that Maria Sanchez is in the data set, and the person also knows that she is 41 and that she has three children? Sort the table by age group and number of children, and see if you can find out how much Maria received from the government.

It turns out that Maria is the only woman who meets these characteristics. So, although the data was de-identified it would still be possible to identify Maria in the data set. This is why pseudonymization is a first step, but it is by far not enough to protect individuals in the data sets. Let’s try to add an additional protection layer, by using a technique that is part of anonymization:

Change the age into an age range (16-25, 26-35, 36-45, 46-55, 56-65).

Are you still able to find Maria?

As you can see, although modifying the data to such an extend, it is still possible to identify someone. This is because even indirect personal identifiers that are not unique to one person can allow you to identify someone in a data set. And the more indirect personal identifiers a data set contains, the more likely it is that in their combination they are unique to only one individual.