or: How cross-company data collaboration can be enabled in a GDPR-compliant way
This blog has been co-authored by David Sturzenegger and Florian Marty from decentriq
Mergers and acquisitions (M&A) totaled in worth almost $2.49 trillion in the first three quarters of 2019. For any M&A, due diligence is essential to confirm information about the two partners and to estimate the future value of the potentially combined companies1. Identifying the size of the shared customer base is an essential task done early in the process. At this stage however, the two companies usually are legally not allowed to exchange their customer databases.
So far, the only workaround has been to hire a trusted third party who gathers the data from both companies and compares them. This comes with a lengthy legal process in itself, especially when the companies are in different jurisdictions and different privacy regulations apply. What if the two companies could determine the size of their shared customer base without having to share their data with anyone?
We recently worked together with decentriq to examine whether processing in the avato platform, which enables decentriq’s clients and partners to securely match and analyze sensitive data via a browser application, is in line with current GDPR regulations. In this blog, we will first introduce the concept of private set intersection as a generalization of the shared customer base problem introduced above. We then show how its solution has been implemented in decentriq’s avato platform in a way that removes the need for a trusted third party (also not decentriq). With avato, both parties provide their customer databases into the platform, receiving very particular security and privacy guarantees: Provably, nobody (not even decentriq) can access their unencrypted data and only the size of the shared customer base is output.
Private set intersection
A private set intersection (PSI) is the process of determining the intersection of two or more datasets (think lists of customer names) without revealing any of the data to anyone. In the M&A case described above, this means calculating the number of shared customers of two companies without disclosing any customer information to any of the companies or any third party.
This is not a trivial task. Using a trusted third party comes with the lengthy processes and costs discussed above. For avoiding the use of third parties, traditionally hashing approaches have been applied. Unfortunately, none of them are really satisfying:
- Naïve approaches apply the same hashing function to the names in both databases, exchange the result and compare the hashes2. The identical name will have the same hash and can thus be identified as shared. As each party knows the hashes of their customers, they can also infer the names of the shared customers. This can already represent a violation of local privacy laws.
- More involved approaches use double-hashing techniques. These are more complicated, susceptible to privacy attacks and most importantly still fail in the common case of slight differences in the names – think “Freddy Mercury” in one database vs “Fred Mercury” in the other.
New developments come to the rescue. Recent advances in hardware-based cryptography enable new, strictly superior solutions to the private set intersection problem.
The key to privacy-preserving PSI is encrypted in an enclave
Decentriq’s avato platform leverages Intel’s Software Guard Extensions (Intel SGX) technology to create so-called secure enclave programs. These are isolated computer programs which can provide additional security and privacy guarantees even when running on public cloud infrastructure.
Let’s assume that Anna and Paul work at the two merging companies and are tasked with computing the size of their shared customer base in a privacy-preserving and GDPR compliant way. They decide to use an avato secure enclave. After receiving the relevant security proofs, they locally encrypt their customer databases and submit them into the secure enclave. Provably, this particular secure enclave is the globally only program that can decrypt this data. In the enclave, the identifiers are matched, and the number of shared customers is sent back to Anna and Paul.
The use of an avato secure enclave gives Anna and Paul the following security and privacy guarantees:
- Only the particular enclave program Anna and Paul are connected to can decrypt their customer databases.
- Nobody can access the decrypted data, including decentriq and potential infrastructure providers running avato.
- The secure enclave only outputs privacy-preserving aggregate statistics such as the number of shared customers.
Using avato provides Anna and Paul with a simple and safe way of performing the private set intersection. Compared to other approaches, it does not require a trusted third party or complicated algorithms while making it possible to use more sophisticated matching algorithms (fuzzy matching) and outputting additional privacy-preserving statistics. Crucially, as long as the above guarantees hold and the output is non-personal data (e.g. the number of shared customers), the described use of avato is in line with GDPR.
Why using avato is in line with GDPR
The term ‘personal data’ is the entryway to the application of the GDPR. ‘Personal data’ is defined in Article 4 (1) GDPR as any information relating to an identified or identifiable natural person. Such a person is referred to as a data subject. The data subjects are identifiable if they can be directly or indirectly identified. The definition of personal data is based on the realistic risk of identification, and the applicability of data protection rules should be based on risk of harm and likely severity.3
According to Recital 26 (5) GDPR, the principles of data protection should not apply to anonymous information, namely information which does not relate to an identified or identifiable natural person or to personal data rendered anonymous in such a manner that the data subject is not or no longer identifiable.
In contrast to anonymous information, there is no mention of the qualification of encrypted information in the GDPR, and so far, no EU/EEA court has explicitly decided whether encrypted data is personal or not. However, the highest agency for data protection regulation in Bavaria (Landesamt für Datenschutzaufsicht) has concluded that encrypted data does not fall under the category of personal data, under the premise that it is encrypted with strong state-of-the-art cryptographic methods.4
Whether encrypted data are personal data therefore depends on the circumstances, particularly on the means reasonably likely to be used (fair or foul) to re-identify individuals.5 Factors affecting encrypted data’s security against decryption include the following:
- strength of encryption method (the algorithm’s cryptographic strength)
- key management, such as security of decryption key storage, and key access control.6
Under WP136, ‘anonymised’ data may be considered anonymous in a provider’s hands if ‘within the specific scheme in which those other controllers (e.g. providers) are operating, reidentification is explicitly excluded and appropriate technical measures have been taken in this respect’.7
According to the UK Information Commissioner’s Office (ICO), if (i) a party has encrypted personal data itself and (ii) is responsible for managing the key, it is processing data covered by the GDPR, since it has the ability to re-identify individuals through decryption of that dataset.8 On that basis, Hon/Millard/Walden suggest that if a party cannot view data, it cannot identify data subjects, and therefore identification may be excluded by excluding others from being able to access or read data.9 By analogy with key-coded data, to the person encrypting personal data, such as a cloud user with the decryption key, the data remain ‘personal data’.10 However, in another person’s hands, such as a cloud-based platform provider storing encrypted data without access to the key and no means ‘reasonably likely’ to be used for decryption, the data may be considered anonymous.11 This removes cloud providers from the scope of data protection legislation, at least where data have been strongly encrypted by the controller before transmission, and the provider cannot access the key.
With encryption, many of the parties who are processing the data do not have the encryption key. The encryption key stays with the generator of the data. This is the case with avato, meaning that encryption in this case bears similarities to the effects of anonymization, as decentriq has no means of reversing the process to access the raw data. In fact, decentriq has no way of knowing whether personally identifiable information is contained in the sets transferred to avato, and as such it would be impossible to define the scope of processing within a data processing agreement with its clients. decentriq also has no more chances of accessing the data than anyone who finds the key by accident. avato’s strong encryption therefore bears effects similar to anonymization, i.e. it renders personal data in the sense of the GDPR into non-personal data from the point of encryption.
As a result of the above, for all intents and purposes, avato as a host of encrypted data is not processing personal data under the definition of the GDPR. decentriq cannot access that data, and even if its servers were breached, data subjects would be at little risk from a privacy standpoint since the data would also be unintelligible to the wrongdoers.
In this blog we have introduced the private set intersection problem and motivated it with the use-case of a potential merger of two companies where the number shared customers should be computed privately. We have argued that traditional approaches to this problem are not satisfactory and that new technologies such as Intel SGX enable strictly superior solutions. One such solution is decentriq’s avato platform which enables provably privacy preserving computation on data. We argued that the use of avato is in line with GDPR, even when the computation is performed on personally identifiable data such as in the outlined case. Even though we have used the example of private set intersection, this generalizes to the many more confidential computing use-cases supported by avato.
3Ustaran E, European Data Protection Law and Practice, 44.
4Tätigkeitsbericht 2017/18 – Bayerisches Landesamt für Datenschutzaufsicht, 89.
5Mourby M, Are pseudonymized data always personal data? Implications of the GDPR for administrative data research in the UK, in Computer Law & Security Review, 2018, Vol. 34, 224.
7Opinion 4/2007 on the concept of personal data, WP136 (2007).
9Hon/Millard/Walden, The problem of ‘personal data’ in cloud computing: what information is regulated? – the cloud of unknowing, in International Data Privacy Law, 2011, Vol. 1, No. 4, 219.