This post is also available in: Español
The Spanish Data Protection Agency (AEPD – Agencia Española de Protección de Datos) published a note in June summarizing a series of recommendations for data controllers and processors who perform anonymization processes with data sets.
The Article 29 Working Party, now replaced by the European Data Protection Committee, had already published an Opinion in April 2014 analyzing a variety of methods or techniques for anonymizing personal data (including K-anonymity). In 2016, the AEPD also published its Guidelines and Guarantees for Personal Data Anonymization Procedures, which examined all the steps to be taken before, during and after data anonymization.
With this note, the AEPD has established the need to implement a series of safeguards to preserve data privacy and protection, mainly as a result of the increasingly common use of techniques related to big data, artificial intelligence and machine learning, specifically by developing one of the possible anonymization techniques used to manage the risk of re-identification, known as k-anonymization.
First, the AEPD classifies the various types of personal data based on the degree of association between the data and the data subject:
1. Key identifiers or attributes: data that uniquely identifies each subject, for example, the National Identification Document number.
2. Quasi-identifiers or indirect identifiers: data that in themselves do not uniquely identify an individual but that in conjunction with other data of the same category may uniquely identify subjects, for example, their age when related to their address.
3. Sensitive attributes: data that could have a major impact on an individual’s privacy, such as special data categories (for example, health-related data) that should not be linked to the data subject.
While there are various tools for carrying out the anonymization process, all attempt to achieve the same legal result: dissociating the identifiers, quasi-identifiers and sensitive attributes to prevent a data subject from being re-identified.
Within this context, there are times when data that was previously anonymized is at risk of being “de-anonymized” through re-identification of the individual by grouping the data in specific ways and by crossing it with other sources of information, even in relation to special data categories. Together with other techniques, K-anonymity is intended to reduce this risk. It is defined by the AEPD as follows:
“ K-anonymity is a property of anonymized data that quantifies the extent to which the anonymity of the data subjects present in a data set has been preserved, even after the identifiers have been removed. In other words, it is a measure of the risk that some external agent could obtain personal information from anonymized data.”AEPD
This is, therefore, a method that would not affect (interfere with) anonymized data, but which could be helpful in quantifying, by means of an algorithm, the risk of a third party “re-identifying” data subjects using data that had initially been anonymized.
The algorithm to be applied, in an apparently simple manner, would be K-1. Applying this technique, an individual will be k-anonymous when, for any combination of associated quasi-identifier attributes, there are at least K-1 individuals who also share the same values for these attributes. K-anonymity works by focusing on the quasi-identifier attributes that allow such a link to be made.
Therefore, it is understood that a higher K value reflects an increase in the degree of anonymity in a directly proportional manner, so that a high value for K could reduce the risk of re-identification. The most widely used K-anonymization methods are:
• K-anonymization by generalization: transforming or generalizing the value of the quasi-identifier attributes so that they lose precision (by creating ranges in the case of numeric attributes or establishing hierarchies for nominal attributes). For example, instead of establishing a specific age, an age range would be introduced.
• K-anonymization by suppression: deleting or removing quasi-identifier attributes, so that they do not “contaminate” or otherwise affect the data set and distort the results (mainly with regard to the records that fall outside the range established by generalization or records with very unusual values).
Both methods involve varying degrees of alteration or distortion during anonymization of the data; however, they may still fail to protect the privacy of sensitive data or prevent links from being established between data sets. For this reason, other privacy techniques are available (e.g., K-anonymity, p-sensitive, l-diversity, t-proximity and δ-disclosure). Despite the application of these methods and techniques, the AEPD also emphasizes that the duty of data controllers is to safeguard the privacy of data subjects and, to the extent that there are risks, anonymization cannot be limited to routine, passive application of rules. Instead, the risks in each case must be analyzed, appropriately selecting the type of quasi-identifier attributes used in an effort to minimize the likelihood of crossing of external data sources when this could pose a risk, thereby ensuring that the anonymization is genuine and not a “disguised” pseudo-anonymization.
By Alejandro Negro Sala, Adaya Esteban Ruíz and Raúl Pérez Terol
This post is also available in: Español