- cross-posted to:
- technology@lemmy.zip
- cross-posted to:
- technology@lemmy.zip
A number of suits have been filed regarding the use of copyrighted material during training of AI systems. But the Times' suit goes well beyond that to show how the material ingested during training can come back out during use. "Defendants’ GenAI tools can generate output that recites Times content verbatim, closely summarizes it, and mimics its expressive style, as demonstrated by scores of examples," the suit alleges.
The suit alleges—and we were able to verify—that it's comically easy to get GPT-powered systems to offer up content that is normally protected by the Times' paywall. The suit shows a number of examples of GPT-4 reproducing large sections of articles nearly verbatim.
The suit includes screenshots of ChatGPT being given the title of a piece at The New York Times and asked for the first paragraph, which it delivers. Getting the ensuing text is apparently as simple as repeatedly asking for the next paragraph.
The suit is dismissive of attempts to justify this as a form of fair use. "Publicly, Defendants insist that their conduct is protected as 'fair use' because their unlicensed use of copyrighted content to train GenAI models serves a new 'transformative' purpose," the suit notes. "But there is nothing 'transformative' about using The Times’s content without payment to create products that substitute for The Times and steal audiences away from it."
The suit seeks nothing less than the erasure of both any GPT instances that the parties have trained using material from the Times, as well as the destruction of the datasets that were used for the training. It also asks for a permanent injunction to prevent similar conduct in the future. The Times also wants money, lots and lots of money: "statutory damages, compensatory damages, restitution, disgorgement, and any other relief that may be permitted by law or equity."
At best it will be a slight setback, I think the cats just outta the bag now.
There's really no cat. We've been using algorithms to do stuff for a very long time (thousands of years), and there's literally no intelligence behind what people are calling "artificial intelligence". It's just another algorithm. This is just another increment in automation like all the rest (plow, printing press, loom, assembly line, computer, etc.), except the marketing is making it sound even more fundamental, when in reality it's really less impressive (i.e. a spell-checking feature added to a word-processing program! ).
Will capitalists still use the term "artificial intelligence" to try to justify whatever BS they're pulling—against other capitalists in the market, but especially against workers? Of course. Just like they're likely to keep using the term "sharing" to bypass labor protections and other regulation having to do with taxis, hotels, etc.
Anyway, we really don't have a horse in this race. Whether the capitalists wanting to preserve "intellectual property" win and Napsters and Pirate Bays keep getting taken down, or the SPAM engine capitalists win and everything we try to do gets flooded with so much barely camouflaged marketing junk that we can't sort through it all. Heads they win/tails we lose. Or whatever boring dumbassery winds up getting settled on in the middle to maximally both preserve and enhance our exploitation, which is the most likely result.
The cat, in this case, isn't necessarily an actual artificial intelligence but is instead a cursed abuse of linear algebra smashed into a shape beyond human comprehension using an unimaginable amount of data and computational power.
A real "let them fight" moment, I'll laugh at whoever loses but I hope that this and other copyright suits strangle the plagiarism machine to death.
The only good outcome is this ending in the abolishment of copyright
That’ll never happen, so this is just another chapter in our boring dystopia. Will probably end in with the most boring agreement or some sort of slow burn that will end up making life worse for everyone. AI bullshit is here to stay imo
Lathe: Copyright remains, but big companies are allowed to use AI to squeeze labour regardless.
NYT is totally going to win this
Honestly I hope NYT goes bankrupt, they have always given me annoyingly smug elitist vibes. Where NY Post does transphobia overtly, NYT does transphobia 'respectfully'
In April 2020, the New York Times ran a story with a strong headline about the situation of press freedom in India: “Under Modi, India’s Press Is Not So Free Anymore.” In that story, the reporters showed how Modi met with owners of the major media houses in March 2020 to tell them to publish “inspiring and positive stories.”
When the case against NewsClick appeared to go cold, the New York Times – in August 2023 – published an enormously speculative and disparaging article against the foundations that provided some of NewsClick’s funds. The day after the story appeared, high officials of the Indian government went on a rampage against NewsClick, using the story as “evidence” of a crime. The New York Times had been warned previously that this kind of story would be used by the Indian government to suppress press freedom.
Is this just postering before the Times has a large round of layoffs and gets an openAI subscription?
Cynicism aside, I would love if this actually hurts AI.
consider this capitalist rivalry. openai basically undermining the writing sector. To cheapen costs NYT would have to eventually adopt this, but it basically cheapens the content produced. it is basically massed produced stuff that will eventually just be really dull. sports illustrated just recently fired their ceo for having already used ai generated stuff.
Buzzfeed tried to replace their news department with ChatGPT and failed. They shut down the division instead. Even though Buzzfeed was already creating nothing original and just publishing a mashup of shit from other sources. Still too complicated for the dumbass plagiarism algorithms, which are basically incapable of producing anything that humans find interesting for more than a few seconds and a couple of "oohs and ahs". LOL.
Honestly?
Good news.
Though it didn't come from the right people.
How do we realistically feel this is gonna play out long term? I think there's no shot the old guard of the ruling class that wants to prevent AI slop from ruining everything wins over the very real incentive to embrace the AI gold rush. Feels like that ship sailed once AI generated articles kept getting lots of clicks even when they're devoid of content. This is very much vibes based though, I hope someone here can illuminate some other factors in their cost benefit analysis.
The cat's already out of the bag. I would be extremely surprised if the NYT gets what they want instead of a "win" where OpenAI pinky promises to stop using NYT content and pays $30 million in damages.
That's what I see as the most likely outcome, yeah, but isn't there gonna be a point where other corporations step in because cheap AI slop is a genuine threat to their bottom line?
In this case, NYT most likely is actually just looking for a cut of the money. Their claims in this are too absurd to actually hold up under scrutiny, nobody is using ChatGPT to bypass NYT's paywall on whatever years old content they actually have in their training data, people are using browser extensions for that. I would also want to know who is the target of the claims that ChatGPT hallucinating about NYT articles is damaging NYT's reputation.
One of the more significant things that could happen is that OpenAI could be forced to disclose details of their training data as part of discovery, which they really will not want to do. It would then be pretty easy to gauge exactly how overfit ChatGPT is (GPT 4.0 has 1.1 trillion parameters, depending on what precision they run it at this would be around a terabyte or more in size, I think 3.5 is closer to 350B, if the dataset has less entropy than the model parameters it is effectively guaranteed to start spitting out exact copies of training data). It would also be very useful info for OpenAI's competitors, so OpenAI will try to get the suit dismissed or settle before then. Deleting their dataset like NYT is demanding is absolutely not going to happen, since at most they have standing to make them delete their articles from their training dataset. Finetuning the model to not comply with NYT-related requests would also be enough to get their model to no longer infringe on their copyrights as well.
They might also be angling for government regulation with a lawsuit making bold claims that they expect to catch headlines and shape public opinion but don't completely expect to stick in court, since that's a recurring pattern in a lot of lawsuits against AI firms, like the Stable Diffusion lawsuit which contained absolute bangers like the claim that it stores compressed images just like JPEG compression and that the text-prompt interface "creates a layer of magical misdirection that makes it harder for users to coax out obvious copies of training images" (this is actually in the announcement for that lawsuit, I'm not making this shit up. It's really not surprising that most of that suit got thrown out).
There's no real endgame for them where they get anything further than a cut. AI companies can still train on copyright-free or licensed data and over time will get similar results, so there's not really anything that can be done to stop that in general. Copyright-reliant industries can certainly secure themselves a better position within that, though, where they might be able to gain either a steady income from licensing fees or exclusive use of their content for models under their control.
Their claims aren’t that absurd; Their articles likely were all used for training data. You could make an argument that that is copyright violation anyways.
I don’t AGREE with copyright but I don’t think the concept is absurd, especially when you’ve already established that legally protecting information behind paywalls is allowed (also stupid).
Using it for training data is one thing, but that's not all that's being claimed. Merely using it isn't enough for it to be infringement because fair use can be a defense, and quite likely a viable one if it wasn't spitting articles out verbatim. People already do use copyrighted data from news sites verbatim for making new products that do something different, like search engines, or for other things that are of significant public interest, like archival. People also do republish articles from news sites, with or without attribution. So for the basic case of copyright infringement by training, NYT has to show that what ChatGPT is doing is more akin to that than it is akin to what a search engine does in order to get something that sticks in court.
They are, among other things, effectively asking for compensation as if people were using ChatGPT as an alternative to buying a NYT subscription, which is just the type of clown shit that only a lawyer could come up with. At the same time, they are also asking for compensation for defamation when it fails to reproduce an article and makes shit up instead. If this case keeps going, those claims are going to end up getting dismissed like a lot of the claims in the Andersen v. Midjourney/Stability AI/Deviantart case did. The lawyers involved know this, they're probably expecting the infringement for training to stick and consider the others to be bonuses that would be nice to have. Probably also door-in-the-face technique as well.
A settlement is probably more likely still, because at the end of the day OpenAI would much rather avoid going through what this case will require of them during discovery, and the most significant claim NYT has against them is literally demonstrating a failure mode of the model, which OpenAI will want to fix whether or not there's copyright issues involved (maybe by not embodying the "STACK MORE LAYERS" meme so much next time). After that's fixed, the rest of what NYT has against them will be much more difficult to argue in court.
How do we realistically feel this is gonna play out long term?
Which side has the most money to hire lawyers and bribe politicians?
That's a good question, right? You'd think that the established media tycoons like Murdoch would have the kind of pull to have killed this baby in the womb, but they didn't. Is that because they're confident they can adapt to it?
The more I think about this, the more I wonder if it's all an elaborate play by the media companies to get the tech companies to buy them out. The tech companies have ridiculously huge cash reserves, and media companies' stocks aren't nearly as valuable as people think. For example, the New York Times has a market cap of $8 billion USD, and made a profit of $90 million USD in their July/August/September 2023 quarter. Apple made $23 billion USD in profit in that same quarter, has a market cap of $3 trillion USD, and has cash reserves that would make Scrooge McDuck envious.
Imagine if all these legal fights over AI scraping are the media industry's way to say to the tech companies "Hey, the data we have the rights to is incredibly valuable to your AI work. We could tie you up in court for years, setting you well behind your competitors. Wanna make a bid?"
That's totally valid, but what about the Disneys, the Universals, and the Sonys? Not all media companies are made equal, and there's a lot of inertia behind those giants despite the falling rate of profit.
Have you SEEN what Disney has been making lately? They'd gladly pivot to AI slop the second it matches their declining quality standards.
Of course it's just an idea. It's probably also a plan that would appeal more to print media companies that have doubts about long-term profitability and stand to lose a lot from text-generation AIs.
You can say what you want about AI art, I think it’s inherently designed to reproduce societal attitudes in its current form, but it’s arguably still an art form despite that.
But, AI writing is basically irredeemable. Only case I can think of for it to have any purpose is in helping disabled people communicate or express themselves. Other than that, it’s literally just a magic lying machine (I don’t care if they’ve decreased the amount of lies, that just makes it more unexpected when it does)
NYT sucks though, fuck em