Some context about this here: https://arstechnica.com/information-technology/2023/08/openai-details-how-to-keep-chatgpt-from-gobbling-up-website-data/
the robots.txt would be updated with this entry
User-agent: GPTBot
Disallow: /
Obviously this is meaningless against non-openai scrapers or anyone who just doesn't give a shit.
I can understand privacy concerns, but I feel like it's inevitable that LLMs will be used to make lots of decisions, some possibly important, so wouldn't you want some content included in its training? For instance, would you want an LLM to be ignorant of FOSS because all the FOSS sites blocked it, and then a child asks an LLM for advice on software and gets recommended Microsoft and Apple products only?