OpenAI’s data disaster. Is this the end for CHATGPT?

April 19, 2023

minute read

Following a temporary suspension in Italy and an increase of inquiries in other EU nations, OpenAI has just over a week to comply with European data protection regulations. If it fails, it may be fined, compelled to destroy data, or even banned.

‍

However, experts have informed MIT Technology Review that OpenAI would be unable to comply with the standards. This is due to the method through which the data used to train its AI models was gathered: by scraping information from the internet.

‍

The mainstream idea in AI development is that the more training data there is, the better. The data set for OpenAI's GPT-2 model was 40 terabytes of text. GPT-3, the foundation of ChatGPT, was trained on 570 GB of data. OpenAI has not disclosed the size of the data set for their current model, GPT-4.

‍

However, the company's desire for bigger models is now coming back to haunt them. Several Western data protection agencies have begun inquiries into how OpenAI obtains and analyzes the data that powers ChatGPT in recent weeks. They suspect it grabbed personal information from people, such as names and email addresses, and utilized it without their permission.

‍

As a precaution, the Italian authorities have restricted the use of ChatGPT, while data regulators in France, Germany, Ireland, and Canada are all looking into how the OpenAI system collects and utilizes data. The European Data Protection Board, the umbrella body for data protection agencies, is also forming an EU-wide task force to coordinate investigations and enforcement in the context of ChatGPT.

‍

The Italian government has given OpenAI until April 30 to comply with the rules. This would imply that OpenAI would need to get authorization from individuals before scraping their data, or demonstrate that it had a "legitimate interest" in acquiring it. OpenAI will also have to explain to users how ChatGPT utilizes their data and provide them the ability to correct any errors the chatbot makes about them, have their data destroyed if they choose, and object to the computer program using it.

‍

If OpenAI is unable to persuade authorities that its data-use tactics are legitimate, it may be prohibited in individual nations or possibly the whole European Union. According to Alexis Leautier, an AI specialist at the French data protection regulator CNIL, the company might face large fines and possibly be required to destroy models and data used to train them.

‍

According to Lilian Edwards, an internet law expert at Newcastle University, OpenAI's infractions are so egregious that the case will almost certainly wind up at the Court of Justice of the European Union, the EU's top court. It might be years before we get a response to the Italian data regulator's queries.

‍

High-stakes game

The stakes for OpenAI could hardly be greater. The EU's General Data Protection Regulation is the harshest data protection system in the world, and it has been extensively replicated across the world. Regulators from Brazil to California will be watching closely what happens next, and the conclusion may profoundly transform the way AI businesses collect data.

In addition to being more transparent about its data practices, OpenAI will have to demonstrate that it is collecting training data for its algorithms in one of two legal ways: permission or "legitimate interest."

‍

It is unlikely that OpenAI will be able to claim that it obtained people's permission to scrape their data. That remains the argument that it had a "legitimate interest" in doing so. According to Edwards, this will likely need the corporation making a compelling argument to authorities about how critical ChatGPT is in order to legitimize data collecting without consent.

‍

OpenAI told us that it thinks it is in compliance with privacy rules, and that it strives to delete personal information from training data upon request "where feasible."

‍

According to the firm, its models are trained using publicly available material, licensed content, and content created by human reviewers. But that's too low a hurdle for the GDPR.

‍

"The United States has a doctrine that once something is in public, it is no longer private, which is not at all how European law works," Edwards explains. As "data subjects," persons have rights under the GDPR, such as the right to be informed about how their data is collected and used, as well as the right to have their data erased from systems, even if it was previously public.

‍

Finding a needle in a haystack

Another problem confronts OpenAI. According to the Italian regulator, OpenAI is not being upfront about how it obtains data from users during the post-training phase, such as in chat logs of their interactions with ChatGPT.

‍

"What's really concerning is how it uses the data that you give it in the chat," Leautier explains. People frequently disclose intimate, confidential information with the chatbot, informing it about their emotional condition, health, or personal ideas. According to Leautier, it is hazardous if ChatGPT regurgitates this sensitive material to others. Users must also be able to remove their conversation log data under European legislation, he says.

‍

According to Margaret Mitchell, an AI researcher and chief ethical scientist at startup Hugging Face who was previously Google's AI ethics co-lead, identifying individuals' data and removing it from its models will be nearly hard for OpenAI.

‍

She claims that the corporation might have avoided a major difficulty by including rigorous data record-keeping from the outset. Instead, in the AI sector, it is typical to construct data sets for AI models by indiscriminately scanning the web and then outsourcing the labor of deleting duplicates or irrelevant data points, filtering undesired stuff, and repairing mistakes. Because of these methodologies, as well as the sheer magnitude of the data collection, tech companies typically have a very limited grasp of what went into training their models.

‍

According to Nithya Sambasivan, a former Google research scientist and entrepreneur who has researched AI's data practices, tech firms don't record how they gather or annotate AI training data and don't even know what's in the data set.

‍

Finding Italian data in ChatGPT's massive training data set will be like looking for a needle in a haystack. Even if OpenAI is successful in erasing users' data, it is unclear if this is a permanent move. According to studies, data sets can be found on the internet long after they have been destroyed since duplicates of the original can be found.

‍

"The state of the art around data collection is very, very immature," Mitchell explains. This is due to the enormous amount of effort that has gone into developing cutting-edge methodologies for AI models, whereas data collecting methods have not altered in the last decade.

‍

According to Mitchell, work on AI models is overemphasized in the AI community at the detriment of everything else: "Culturally, there's this issue in machine learning where working on data is seen as silly work and working on models is seen as real work."

‍

"As a whole, data work requires significantly more legitimacy," Sambasivan says.

‍