- cross-posted to:
- technology@lemmy.zip
- cross-posted to:
- technology@lemmy.zip
A number of suits have been filed regarding the use of copyrighted material during training of AI systems. But the Times' suit goes well beyond that to show how the material ingested during training can come back out during use. "Defendants’ GenAI tools can generate output that recites Times content verbatim, closely summarizes it, and mimics its expressive style, as demonstrated by scores of examples," the suit alleges.
The suit alleges—and we were able to verify—that it's comically easy to get GPT-powered systems to offer up content that is normally protected by the Times' paywall. The suit shows a number of examples of GPT-4 reproducing large sections of articles nearly verbatim.
The suit includes screenshots of ChatGPT being given the title of a piece at The New York Times and asked for the first paragraph, which it delivers. Getting the ensuing text is apparently as simple as repeatedly asking for the next paragraph.
The suit is dismissive of attempts to justify this as a form of fair use. "Publicly, Defendants insist that their conduct is protected as 'fair use' because their unlicensed use of copyrighted content to train GenAI models serves a new 'transformative' purpose," the suit notes. "But there is nothing 'transformative' about using The Times’s content without payment to create products that substitute for The Times and steal audiences away from it."
The suit seeks nothing less than the erasure of both any GPT instances that the parties have trained using material from the Times, as well as the destruction of the datasets that were used for the training. It also asks for a permanent injunction to prevent similar conduct in the future. The Times also wants money, lots and lots of money: "statutory damages, compensatory damages, restitution, disgorgement, and any other relief that may be permitted by law or equity."
Using it for training data is one thing, but that's not all that's being claimed. Merely using it isn't enough for it to be infringement because fair use can be a defense, and quite likely a viable one if it wasn't spitting articles out verbatim. People already do use copyrighted data from news sites verbatim for making new products that do something different, like search engines, or for other things that are of significant public interest, like archival. People also do republish articles from news sites, with or without attribution. So for the basic case of copyright infringement by training, NYT has to show that what ChatGPT is doing is more akin to that than it is akin to what a search engine does in order to get something that sticks in court.
They are, among other things, effectively asking for compensation as if people were using ChatGPT as an alternative to buying a NYT subscription, which is just the type of clown shit that only a lawyer could come up with. At the same time, they are also asking for compensation for defamation when it fails to reproduce an article and makes shit up instead. If this case keeps going, those claims are going to end up getting dismissed like a lot of the claims in the Andersen v. Midjourney/Stability AI/Deviantart case did. The lawyers involved know this, they're probably expecting the infringement for training to stick and consider the others to be bonuses that would be nice to have. Probably also door-in-the-face technique as well.
A settlement is probably more likely still, because at the end of the day OpenAI would much rather avoid going through what this case will require of them during discovery, and the most significant claim NYT has against them is literally demonstrating a failure mode of the model, which OpenAI will want to fix whether or not there's copyright issues involved (maybe by not embodying the "STACK MORE LAYERS" meme so much next time). After that's fixed, the rest of what NYT has against them will be much more difficult to argue in court.