Data Protection

From concept to compliance: Legal Bases for AI training explained

Last Updated 13/03/2024

Whether you are a founder, decision maker in the AI industry or just starting to develop your first AI pilot, it is important to think about privacy rights from the start.  This topic becomes especially important when training AI systems because choosing a legal basis for your training purposes affects how practical the process is and the rights of the people whose data is used. In this blog post, we’ll explain the key points in line with the EU’s General Data Protection Regulation (GDPR) and the Swiss Federal Act on Data Protection (FADP). 

Understanding data sources for AI training

A key first step in training AI systems to comply with GDPR and FADP is figuring out if you’re using personal data, and if so, knowing where it comes from. This is especially important because data protection laws apply only to personal data, not to anonymized data, for example.  

The difference between personal data and anonymized / synthetic data in AI training 

Personal data is data that identifies individuals either directly or indirectly.  The AI model may be trained using personal data, or you are not using personal data directly but there might be a risk of indirectly identifying people through techniques like membership inference and model inversion. This means that GDPR and/or FADP rules could apply to your model. In order to avoid this in the first place, one option is to only use anonymised data to train your AI systems. By making data anonymous, you break its connection to any specific person. This means GDPR and FADP rules no longer apply, as the data isn’t considered personal anymore. This is especially relevant when particularly sensitive information (e.g., medical data) should be used as training data.  Keep in mind, though, that making data anonymous is a complex task. It’s crucial that the anonymized information can’t be linked back to anyone specific. This also means you need to regularly check if the anonymization can still be considered secure with current methods. 

An additional solution may be the use of synthetic data, meaning data created through algorithms to mimic real-world data. While it is a more privacy-friendly solution that may allow you to escape the GDPR and the FADP in some circumstances, it is not a bullet proof solution, since it may still be traced back to particular individuals depending on how it was generated or on the underlying data that was used to make the synthesis. 

Identifying data sources for AI systems: Ensuring GDPR and FADP compliance 

If you have to use personal data for your AI training or can simply not exclude the fact that your data set may include personal data, the next step is to have a look into your data source. Personal data for AI training can be acquired either directly from data subjects or indirectly.  

Direct collection involves data provided by users. This can occur in a pre-deployment phase – e.g. when patients allow for their medical exams to be used to train an AI system – or after deployment, when users interact with AI systems (e.g. user-submitted prompts in generative AI systems). Indirect sources include web-scraped data and data provided by third-parties (e.g. data brokers).  

In both cases, you must ensure that the data processing aligns with data protection principles and practices. Nevertheless, the concrete way of doing it will vary from one case to the other.  In simple terms, how you collect personal data can create various issues with following GDPR and FADP rules. Not having direct contact with people whose data you use can make it harder to get their consent and respect their rights, like informing them about the data processing. We’ll look into this more below. 

Choosing the right legal basis for AI Training 

If data protection law, especially GDPR, is applicable, one of the first steps you need to do is ensure that you have a legal basis to process personal data for the training of your AI system. In this section, we will explore the main legal bases that are applicable in this ecosystem, providing you with concrete input on the benefits and hurdles of each one.  

Generally applicable legal bases 

While the FADP does not automatically require a pre-determined legal basis for the processing activity unless the data processing is a violation of the data subject’s personality rights (e.g. when it is done against their express will, when sensitive personal data is shared with third parties), the GDPR requires you to have one. This means that you will need to determine an appropriate legal basis for AI training before using personal data. 

For the processing of personal data that does not pertain to a special category of personal data (e.g. health data), the following legal bases may be applicable, depending on the context: 

  • Execution of a contract: This legal basis can be used if the training of the AI system and the related processing of personal data are indispensable to fulfilling a contract with the data subject. However, the fact that the contract needs to be with the data subject itself makes its applicability very limited (e.g. the development of personalised AI systems, tailored to the data subject to whom the system is provided);
  • Consent: While consent is a logical choice, obtaining it can be challenging, especially when dealing with web-scraped data. These hurdles are due to the lack of direct contact with data subjects. For that reason, when personal data is not directly obtained from data subjects, it may not be an appropriate legal basis to use, as it is arguably quite difficult to obtain an informed, unambiguous, freely given, and specific consent in these circumstances;
  • Legitimate interests: Legitimate interests emerge as a versatile basis. In practice, you will need to do a balance of interest, where the necessity of the training your AI on personal data, examined against alternatives, must outweigh the potential impacts on data subjects’ rights. You must also ensure transparency by informing data subjects of your legitimate interests and establish opt-out mechanisms for the right to object (more on that below). 

Picking the right legal basis needs careful thought. Even though choosing “legitimate interests” might seem practical at first, there are still legal grey areas. You have to carefully weigh different needs to make sure you’re respecting people’s rights. This includes paying close attention to how much and what types of personal data you need, who the data is about, especially sensitive groups like children, and the nature of the data, whether it’s real, synthetic, or pseudonymized. A first good start could be adopting technical measures that ensure adequate levels of security of the personal data that is processed (e.g. differential privacy), as well as curating the training data to limit the amount of sensitive information that is present (e.g. identifying and filtering personal information as well as to de-duplicate data). 

Change of processing purpose 

When AI training uses personal data for a new purpose that’s different from why it was originally collected, such as fulfilling a contract, it’s important to remember that GDPR has rules for changing the purpose of data use. If the new use, like AI training, doesn’t match the original purpose, GDPR requires you to find a new legal basis for this new use. This means that AI systems cannot be trained by merely “recycling” previously processed data without proper legal justification. You must ensure that if you reuse data for AI training, you establish a new, appropriate legal basis for this specific purpose and inform data subjects of it. 

To streamline this process, you should mention in your documents, like privacy policies, that you’ll be using personal data for training purposes and explain the legal reason for it. This approach helps prevent the need to inform individuals again about any changes in how their data is used. 

Special categories of personal data 

It is also important that there is an extra layer of complexity if the AI system is trained on special categories of personal data. When training AI on special categories of personal data like health, political views, or religious beliefs, developers must adhere to stricter rules, as processing such data is generally prohibited, except under narrowly defined exceptions.  

The best legal basis for using sensitive data in AI training, according to both GDPR and FADP, is usually getting consent. This is because there’s a big difference between using the AI system and training it. For example, a doctor might use AI to work with health data under a contract, keeping things confidential. But when it comes to training the AI, getting proper consent is often the only way to go. 

Getting this consent can be tricky. Also, while the FADP doesn’t always ask for a legal reason to process data, you need one if using the data could infringe on someone’s personal rights, like sharing sensitive information without third parties. Here, consent is often the only choice for AI training. Because of these challenges with sensitive data, it’s best to use anonymized data for training if you can. Anonymized data doesn’t count as personal data, so it doesn’t fall under GDPR or FADP rules. 

Data subject rights in AI training: Challenges under GDPR and FADP

In this last section, we will explore the main challenges that AI training poses when fulfilling data subjects’ rights, both under the GDPR and the FADP. Under both data protection laws, data subjects have a set of rights. However, in the AI realm, their practical implementation is not as easy as it seems. In particular, it faces the following hurdles: 

Right to be informed

You are required to inform data subjects on the use of their personal data for training purposes. However, this proves difficult having in mind that, given the vast amount of information that is needed to train these systems and the usage of indirect data sources (e.g. web scrapping and data suppliers), it is very difficult to individually identify and contact data subjects.  

GDPR and the FADP exempt you from informing data subjects of the processing activities, when that proves disproportionally difficult or impossible. Even if this exemption under the GDPR tends to be interpreted narrowly by some EU supervisory authorities, there seem to be good arguments for its application in the scenario that the data used is coming from indirect sources. 

Nevertheless, you still need to adopt other appropriate measures to convey the relevant information regarding your processing activities, for example by providing a privacy policy on your website or app. 

Withdrawal of consent and the right to object

If the legal basis for the AI training was consent, data subjects have the right to revoke it with effect for the future. This means that the personal data can no longer be processed (unless another legal basis is applicable) and should be deleted. 

The right to object plays a similar role but regarding legitimate interests: if you used legitimate interests as a legal basis for the processing of personal data to train your AI system, data subjects have the right to object to that processing activity and – if you do not have a preponderant interest – its usage must stop. 

This means that although they may seem like the most probable legal basis to train AI, they come with a handicap: under some circumstances, you will need to stop your processing activity for specific data points if requested by the data subject. This will not only affect the training of the system but may also be nearly impossible to do in practice. 

Right to erasure and to rectification

In the context of training AI systems, ensuring the right to erasure and rectification to data subject is complex. This difficulty is derived from many factors such as: 

  • ‘Continuous learning’ – Some AI systems are regularly updated based on user interactions (e.g. the interactions and prompts of the system’s user). This means that personal data is continuously processed and added, making it progressively harder to track all user data that is processed;
  • ‘Non-retrievability’ – Since the information used to train these systems comes from a variety of sources, it is very difficult to trace individual data subjects. This is made harder due to the deep embedding of personal data with the underlying algorithms, where isolating personal data proves to be challenging; 

Lastly, there are challenges that relate to the system itself. Altering or removing personal data from the training set may impact the model’s validation and accuracy, which may imply the necessity to retrain it. This is a costly, time-consuming and technically difficult endeavour. 

Even if there are technical ways currently being explored to facilitate the process of removing data for some types of AI systems (e.g. machine unlearning), the research done so far, while not entirely conclusive, suggests that it may come at a cost for the accuracy of the system, as well as a requiring time and deep technical know-how on how to do it. 

Best practices for aligning AI training with GDPR and FADP: A conclusion

In conclusion, harmonizing AI training with data protection requirements entails navigating data origins, legal bases, and data subject rights. You should align your practices with the applicable principles, fostering responsible and compliant AI development, namely by: 

  • Prioritizing the anonymization of personal data used for training AI systems, as well as synthetic data to mitigate risks of non-compliance with GDPR and FADP; 
  • Assessing the appropriate legal basis for processing personal data in AI training, with a preference for legitimate interests where feasible, and ensuring all processing activities are transparent and fair; 
  • Refraining from processing special categories of data, opting for anonymising them instead; 
  • Implementing measures to inform data subjects about the use of their personal data in AI training; 
  • Developing and maintaining clear mechanisms for data subjects to exercise their rights, including withdrawal of consent and the right to object; 
  • Investing in technological solutions that enable compliance with data subjects’ rights, such as data erasure and rectification, even in complex AI training environments. 

Following these guidelines helps build trust and ensures that your AI projects are both creative and in line with data protection laws, leading to a more ethical and lasting tech future. 

By Sebastian Schneider

Head of Privacy & Digital Regulation, Legal Expert

Co-authored by: Francisco Arga e Lima


Let’s Go!

Book a free, non-binding discovery call to discuss how we can help you achieve your business goals.

Or feel free to reach us directly via email at [email protected].

Book a free call