5.2 Challenges and good practices of de-identification
Read through the following slides to learn more about why it is so difficult to de-identifying and de-anonymizing data, and what are good practices that you can apply to lower the risk. Eventually, continue to make a small quizz.
In 1996, the Massachusetts General Insurance Commission released 135k data sets of state employees and their families with their health records in an anonymized form. The only data it entailed that still related to the individuals were ZIP, Birth Date and Sex. The researcher Latanya Sweeney analysed the data and crossed it with Voter Registration data. That way, Sweeney was for example able to re-identify the governor of Massachusetts in the data set and could look up his health industry. Thanks to this famous case of were anonymization went wrong, we now know that 87% of the US population is uniquely identifiable by the combination of the three indirect identifiers ZIP Code, Birth Date and Sex.
There are different risks that can arise:
- Identification: You already learned about identification above in the case of the health care database. Basically it means that you are able to find / identify someone in a date.
- Attribution: Even if you are not able to identify an individual in a data set you might be able to make attributions about someone which become a risk for that person. Look for example at this table, if you know a women in this table you will be able to know that she does perceive at least a government’s benefit that is medium, despite the fact that you are not able to identify her.
|Gvt. benefits received||High||Medium||Low|
If you want to anonymise a dataset, it is crucial that you understand your data. For example,
you have to know
- If there are special or unique cases
- which combination of variables are risky
- what is sensitive information.
In small data sets you might be able to “eyeball” potential challenges; so looking at the data set and using for example the sorting functions in Excel you might be able to see that there are only few individuals i.e. of a certain age group in a data set. For bigger datasets descriptive statistics, such as visualizing the data set, might help you to identify risky data.
Generally, there is some information that is more likely to cause risks and where you should consider carefully how you will store, share or open it. Those red flags are:
- Sensitive data
- Very small databases
- Data sets that entails information about small sized communities
Always ask yourself if you really need to publish these data sets or the specific data column. Maybe the data set would actually have the same value if you deleted some of the information.
The challenge with de-identification is that it is never enough to only assess the risk of that one dataset. The information contained in one dataset can turn into a risk because of other data that is available and can be crossed with this data set and hence lead to re-identification. This can be
- Background knowledge someone else might have about a person in a data set
- Information that is available to some actors (i.e. an insurance company)
- Information that is accessible in other open data sets (like in the example presented from the health data in Massachusetts)
- Data that is openly available for example through social media (i.e. because an instagram post includes a geo-reference)
Today it becomes more and more difficult to control the risk a data set could have, because more and more information is openly available online that can potentially be linked to a data set.
In session 2 we discussed potential threats. A potential perpetrator could be for example a jealous ex-husband or a criminal. With some technical knowledge they might be able to re-identify the data and learn something about their victim.
However, in a growing number of cases it will be rather an algorithm that automatically crosses and analyses datasets, for example to re-identify an individual in data sets in order to create a profile about an individual to then be sold to an advertiser.
This is why: Anonymisation is a heavily context-dependent process and only by considering the data and its environment as a total system, can one come to a well informed decision about whether and what anonymisation is needed. Good techniques are important but without a full understanding of the context, the application of complex disclosure control techniques are useless.
When modifying datasets important information might get lost. This is called the Privacy-Utility-Trade Off. That is why understanding the context of the data is so important. Generally, as a rule of thumb, it is not recommended to share or open up data sets that contain direct identifiers. But there might always be exceptions. For example, if this data set is to detect corruption or miss-spending the value for the public would be very high and sharing this information might not only be justified but even necessary.
In some cases you might realise that de-identification is very difficult, and that the data would need to be changed to such an extent that it becomes useless. In these cases a better option might be to:
- Release only to some groups
- Give restricted access to data banks
- Different granularity – Provide only aggregated data