Consider https://arstechnica.com/robots.txt or https://www.nytimes.com/robots.txt and how they block all the stupid AI models from being able to scrape for free.
Consider https://arstechnica.com/robots.txt or https://www.nytimes.com/robots.txt and how they block all the stupid AI models from being able to scrape for free.
No, robots.txt doesn’t solve this problem. Scrapers just ignore it. The idea behind robots.txt was to be nice to the poor google web crawlers and direct them away from useless stuff that it was a waste to index.
They could still be fastidious and follow every link, they’d just be ignoring the “nothing to see here” signs.
You beat scrapers with recursive loops of links that start from 4pt black on black divs whose page content isn’t easily told apart from useful human created content.
Traps and poison, not asking nicely.