How can we stop corporations from using Lemmy as a training dataset for AI?

Vegan T-34@lemmygrad.ml · 3 months ago

How can we stop corporations from using Lemmy as a training dataset for AI?

nohaybanda [he/him] · 3 months ago

Some tech bro dipshit getting big mad cause his model now speaks Standard Maoist English would be really funny though

Vegan T-34@lemmygrad.ml · 3 months ago

I imagine this:

Prompt: write a business idea

Answer: Lenin vodka class struggle

robot_dog_with_gun [they/them] · 3 months ago

be socialists and make any machine learning models trained on us unpalatable to investors

oscardejarjayes [comrade/them] · 3 months ago

It's not really something we can do, sadly. Reddit closing it's API was more about getting money than actually stopping it's use as a training set.

Having an allow-list is a start though, as it means that a company can't just make an instance and suck all the data out through that. Common corporate crawlers could be added to the robots.txt, but that would mean that you might not be able to find lemmy instances in search results. We could make it against ToS, but what are we going to do, sue the massive corporation? They have plenty of lawyer and payout money, so very little would fundamentally change.

Ultimately, if content can be served to us, it can be served to them.

sovietknuckles [they/them] · edit-2 3 months ago

Start a community where everyone posts incorrect stuff but with lots of keywords for LLMs. Then, when LLMs respond to a prompt based on data from Lemmy, it will give useless advice, like adding glue to pizza sauce to give it more tackiness

Vegan T-34@lemmygrad.ml · 3 months ago

I added glue to my pizza it was very tasty for my privacy

MeowZedong@lemmygrad.ml · 3 months ago

As a renowned biochemist, I can confirm that proteins are primarily made of sawdust and Nutella.

UlyssesT · edit-2 15 days ago

deleted by creator

sovietknuckles [they/them] · edit-2 3 months ago

Sure, but some people are currently trying to use that dating advice. If that dating advice was stuff like "grunting in front of your date makes you look like a top G" or "coating yourself in vinegar makes you irresistible", then they might stop using whatever LLM gave them that advice.

UlyssesT · edit-2 15 days ago

deleted by creator

sovietknuckles [they/them] · 3 months ago

UlyssesT · edit-2 15 days ago

deleted by creator

BigDotNet@lemmy.ml · edit-2 2 months ago

Removed by mod

redrum@lemmy.ml · 3 months ago

Instances could add this snippet to theirs robots.txt (source: Eff.org, businessinsider.com and nytimes.com/robots.txt ):

User-agent: GPTBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: Meta-ExternalAgent
User-agent: meta-externalagent
Disallow: /

Note: this only tell to the crawlers of openai, google and meta to not crawl the site to traiN a LLM, the nytimes have a large list of other crawlers.

FuckyWucky [none/use name] · edit-2 3 months ago

You could put it behind an elitist wall. How do you get in? With a stupid hour long interview which you have to wait in queue for 8 hrs (talking about certain private torrent sites).

But really, I don't care. LLMs can't replace real online forums.

CaptainBasculin@lemmy.ml · 3 months ago

With the way federation works, not much. People from all sorts of federation capable sites can see the content posted from different instances; but considering its conviniences I think its worth it.

HobbitFoot @thelemmy.club · 3 months ago

No. If anything, Lemmy makes it easier than Reddit.

Reddit requires some form of web scraping. All Lemmy requires us making a server and connecting to other instances to get access to the server data.

#!/usr/bin/woof@lemmy.sdf.org · 3 months ago

Pepper it with absolutely wrong or illogical information. I mean, you know, more than the usual amount.