Blog
Best Practices and Techniques for Pseudonymization
Pseudonymization is a de-identification process that has gained traction due to the adoption of GDPR, where it is referenced as a security and data protection by design mechanism. The application of pseudonymization to electronic healthcare records aims at preserving the patient’s privacy and data confidentiality.
In the US, HIPAA provides guidelines on how healthcare data must be handled, while data de-identification or pseudonymization is considered to simplify HIPAA compliance. According to GDPR, if pseudonymization is properly applied can lead to the relaxation, up to a certain degree, of data controllers’ legal obligations.
Even though pseudonymization is a core technique for both GDPR and HIPAA , there are significant differences in the legal status of the generated data. Under GDPR, pseudonymous data is still personal data, while under HIPAA it can be shared provided that the correct data fields are pseudonymized.
Definition of Pseudonymization
Article 4(5) of the GDPR defines pseudonymization as “the processing of personal data in such a manner that the personal data can no longer be attributed to a specific data subject without the use of additional information.” However, the Regulation notes that the process of de-identifying data is not irreversible and is subject to provisions such as that “such additional information is kept separately and is subject to technical and organizational measures to ensure that the personal data are not attributed to an identified or identifiable natural person.”
It must be noted that pseudonymization is different from anonymization, which is defined in ISO/TS 25237:2017 as the “process by which personal data is irreversibly altered in such a way that a data subject can no longer be identified directly or indirectly, either by the data controller alone or in collaboration with any other party.”
The distinction between these two terms is best depicted in the image below.
Figure 1: Pseudonymization vs Anonymization. Image courtesy of CHINO.IO
Benefits of Pseudonymization
The most obvious benefit of pseudonymization is to hide the identity of the data subjects from any third party in the context of a specific data processing operation.
The proliferation of digitalized healthcare services and patient care facilitates the implementation of beneficial research and studies that combine large, complex data sets from multiple sources. The process of de-identification presents the potential of mitigating privacy risks to individuals and therefore can be utilized to support the secondary use of data for comparative analysis, policy assessment, scientific research, personalized medicine, and other health-related endeavors.
In addition, GDPR considers properly pseudonymized data as:
- A safeguard to help ensure the compatibility of new data processing [Article 6(4)]
- A technical and organizational measure to help enforce data minimization principles and compliance with Data Protection by Design and by Default obligations (Article 25)
- A security measure helping to make data breaches “unlikely to result in a risk to the rights and freedoms of natural persons”, reducing liability and notification obligations for data breaches (Articles 32, 33 and 34)
Pseudonymization Techniques
A recent report by the EU Agency for Cybersecurity (ENISA) explores technical solutions that can support the implementation of pseudonymization in practice.
In principle, pseudonymization maps identifiers (i.e. names, IP addresses, email addresses, etc.) to pseudonyms. For a pseudonymization function to be effective there is only one fundamental requirement: it must verify that pseudonym pseudo1 corresponding to identifier id1 is different than pseudonym pseudo2 corresponding to identifier id2. Otherwise, the recovery of the identifier would be ambiguous, and we cannot ascertain if pseudo1 corresponds to id1 or id2. However, a single identifier can be associated to multiple pseudonyms if it is possible to invert this operation.
In all cases, the association of pseudonyms to the original identifiers is performed by what is called a pseudonymization secret. Because of its importance to the efficiency of the pseudonymization operation, the corresponding secret must be protected by adequate technical and organizational measures. The pseudonymization secret must be isolated from the dataset, or it will be too easy for an adversary to recover the identifiers. In addition, strong access control policies must ensure that only authorized personnel have access to this secret. Finally, the pseudonymization secret must be encrypted if it is digitally stored, which necessitates proper key management and storage requirements.
Counter
Counter is the simplest pseudonymization technique. The identifiers are substituted by a number chosen by a monotonic counter. It is critical that the values produced by the counter never repeat to prevent any ambiguity. The biggest advantage of this technique is its simplicity, however, the solution may present implementation and scalability issues in large and sophisticated datasets, as the complete pseudonymization mapping table needs to be stored.
Random Number Generator (RNG)
RNG is a mechanism that produces values that have an equal probability of being selected from the total population of possibilities. These unpredictable values are then assigned to an identifier. There are two options to create this mapping: a true random number generator or a cryptographic pseudo-random generator. RNG provides strong data protection since it is difficult to extract information regarding the initial identifier unless the mapping table is compromised. However, scalability might be an issue depending on the de-identification scenario because the complete pseudonymization mapping table must be stored.
Cryptographic Hash Function
A cryptographic hash function takes input strings of arbitrary length and maps them to fixed length outputs. The hashing function is directly applied to the identifier to obtain the corresponding pseudonym, which depends on the length of the digest produced by the function. A hashing function contributes towards strong data privacy; however, it is considered a weak pseudonymization technique as it is prone to brute force and dictionary attacks.
Message Authentication Code (MAC)
MAC is considered a keyed-hash function because a secret key is required to generate the pseudonym. Without the knowledge of this key, it is not possible to map the identifiers and the pseudonyms. HMAC is the most popular design of MAC used in Internet protocols. MAC is generally considered as a robust data protection pseudonymization technique, since reverting the pseudonym is infeasible, provided that the key has not been compromised. Different variations of the method may apply with different utility and scalability requirements.
Encryption
Encryption is another robust pseudonymization technique, provided that the encryption key has not been compromised. Although many think of encryption as an anonymization technique, the fact that it takes a “secret” – the encryption key – to map an identifier to a pseudonym makes the ciphertext a pseudonym , and therefore personal data. The length of the identifier to be de-identified using encryption is limited by the block size of the cipher to be used.
Advances in cryptography, such as Fully Homomorphic Encryption (FHE) , may render encrypted data as anonymized since they permit operations on encrypted data without decrypting them. Unfortunately, due to high computing overhead, FHE is at present highly inefficient and not a practical alternative to the processing of personal data.
What is the Best Technique?
The choice of a pseudonymization technique depends on the data protection level and the utility of the pseudonymized dataset. In terms of data protection, RNG, message authentication codes and encryption are stronger techniques as they can mitigate all known attack vectors. However, utility requirements might lead to a combination of different approaches or variations.
Risk-Based Pseudonymization
The challenge of the proper application of pseudonymization to personal data is a highly debated topic in many different industries and sectors, ranging from research and academia to justice and law enforcement and to compliance management in healthcare.
Data pseudonymization in complex information infrastructures, such as in healthcare, is challenging, with high interdependencies of context, involved entities, data types, background information, and implementation details. The ENISA report highlights that there is no single, easy solution to pseudonymization that works for all approaches in all possible scenarios.
On the contrary, it requires highly skilled and competent security and privacy professionals to apply a robust pseudonymization process, minimizing the threat of discrimination or re-identification attacks, while maintaining the degree of utility necessary for the processing of the pseudonymized data.
The security and privacy professional needs to adopt a risk-based approach with respect to the choice of the proper pseudonymization technique, to properly assess and mitigate the relevant privacy threats, taking into account the purpose of the personal data processing, and the utility and scalability levels they wish to achieve.
How the HCISPP Certification Can Help You Succeed
If you are currently a security practitioner working in the healthcare field, or you are looking to enter the area of healthcare security, the HealthCare Information Security and Privacy Practitioner (HCISPP) certification offered by (ISC)2 is the perfect vehicle to enhance your knowledge and skills. Not only does this credential give you the skills you need to function at the highest levels of a healthcare organization, but it shows your employer that you possess specialized knowledge and dedication specific to the healthcare profession.
Download our white paper, Not All Life Savers Wear White Coats , to learn more about pseudonymization techniques.