Diaries of an AI trouble maker
Posts
The New York Times sued OpenAI and I agree with them

The New York Times sued OpenAI and I agree with them

A full explanation of the lawsuit and what it means for the future of AI

Gianluca Mauro
January 05, 2024

Hope you all had some refreshing holidays! I unplugged completely, spending time with my family in Italy (selfie with our dog Martin at the bottom of this email).

There was absolutely NO unplugging in the world of AI though. The big news is about the New York Times sueing OpenAI for a bunch of stuff. This is potentially a defining moment in the world of AI so we must understand it. Since there have been a lot of simplifications around, I want to take a moment to explain what are the NYT’s claims and what they may mean, so buckle up.

Understanding NYT’s claims

The first claim is that OpenAI used copyrighted material to train their AI models. This is what everyone is focusing on at the moment, but I don’t think this is the most important part of the lawsuit we need to understand it because it’s where everything starts.

When OpenAI was actually “open” they were used to reveal the datasets they used for training. One of them is called “Common Crawl”, a free repository of internet data. The NYT’s articles are the third biggest data source in this dataset, right behind Wikipedia and a database of US patents.

To the best of my knowledge, the NYT articles in Common Crawl are open articles that are legally allowed to be scraped. However, it looks like one reason why OpenAI has stopped disclosing the data they used to train their models may be because they’ve added also copyrighted material.

OpenAI has protected that list claiming it’s a trade secret, but ChatGPT betrayed them.

The NYT has proved that if you ask ChatGPT to spit out an entire copyrighted or even paywalled article, it will happily do so. How’s it possible? OpenAI must have stolen this data from the NYT, breaking the law [side note: in the meantime, OpenAI patched this behavior so that ChatGPT will tell you to check the NYT. Microsoft has been slower and Copilot will still do it].

This leads me to the second point of the lawsuit, even more interesting to me than the use of copyrighted material (there are already 13958 other lawsuits on that problem). OpenAI has historically defended itself saying that GPT models transform the data they’ve been trained on, so they are not infringing copyright. The NYT has proved it’s not true, as GPT models - when instructed - will happily reproduce word-by-word content that is supposed to be protected.

This means that the NYT has substantial monetary damage because people can read their articles by either paying or just asking ChatGPT. This is very close to stealing.

But that’s not it. The last point of the lawsuit is that OpenAI is also causing reputational harm to the NYT. GPT models are known to often hallucinate: invent random stuff that is plain wrong. This also happens when quoting the NYT’s work. From the lawsuit: "A GPT model completely fabricated that “The New York Times published an article on January 10, 2020, titled ‘Study Finds Possible Link between Orange Juice and Non-Hodgkin’s Lymphoma,’ - The Times never published such an article.”

The NYT says they pay hundreds of people to do proper journalism and make sure everything they publish is truthful, and GPT models hurt that reputation. I think that’s a very fair reason to be pissed.

So what’s the NYT asking? Two things:

Tons of money
The deletion of every GPT model and the copyrighted training data (!!!)

Oh, they sued both OpenAI and Microsoft.

Who wins? Who loses?

First of all: I think the NYT is right and they’ll win.

I’m not sure whether OpenAI will have to delete their GPT models, but I’m very confident they’ll at least be forced to disclose the training data.

The most important question is: “so what?”. Even if the worst-case scenario for OpenAI plays out and they need to delete their models, what will happen then? There are a few scenarios to play out.

Technically, it may as well change nothing. While the NYT’s data is the 3rd biggest data source in the Common Crawl dataset, it still constitutes just ~0.0083% of it. OpenAI can re-train their models without that data with limited impact on quality. However, forcing OpenAI to reveal the training data of their models may reveal other skeletons in their closet - so there’s a chance that more will have to be removed (with a larger impact on performance - and more lawsuits).

The real winner may not be the NYT, however. The real winners may be two: Apple and Open Source. Apple has been criticized for being slow at the GenAI game, but I question whether that pace was dictated by laziness, lack of skills, or by the fact that they were trying to do things right.

I reckon it may be the latter. While OpenAI was probably stealing data from the NYT, Apple has been trying to sign deals with publishers to use their data to train AI models, paying up to $50M for it. So there’s a scenario in which OpenAI/Microsoft arrived early to the game by cheating just to get sued and make 10 steps back, while Apple took the long (legal) route, winning in the long term. Open source may also pick up, especially if the training data behind these models is disclosed and legal.

This conversation brings back the biggest headache generalist AI products give me: the business model. To the best of my knowledge, OpenAI is not profitable yet - we don’t know how many people pay for ChatGPT and whether the revenues make up for the insane costs of running the models (subsidizing free users). So technically they could get out of this whole shitshow by paying the NYT fairly, but can they afford it?

OpenAI has made a product people want. They’ve not made (yet) a product that is profitable, or legal to sell.

Now let’s talk about what this means for the rest of the world. The real losers will be all of us working on AI adoption. Put yourself in the shoes of some of my clients: large organizations betting millions on AI, integrating it into their product, re-training their teams to use it, and changing their processes to embed it. And now they find out the entire engine powering their investments may be illegal and disappear from one day to another?

I would not blame them for making a 180° and say “Never mind, I’ll come back once I can trust this”. Would you build your new house on a foundation that the government may take away from you? I wouldn’t.

I find myself once again criticizing the “move fast and break things” approach of Silicon Valley, because the “things” they break often ruin the party for everyone else. AI has the potential to revolutionizing the way we work for the better and I can’t wait to get there and see that value. However, taking shortcuts may end up making all of us waste time and energy to get what we want.

My company AI Academy is working on the course curriculum for 2024 and we’ve opened waiting lists for each program. Please join the list for the programs you find interesting - your submission will inform what courses we prioritize.

Now as promised, here’s my Martin pic 🐶