CSAM exists on the internet and the web-crawlers that hoover up data aren't particular to their source of material. Its just one big vacuum absorbing every image or text file that isn't behind security.
Consequently, there's a ton of training data on how to reproduce and replicate these images.
That data goes through a lot of filtering and classification before it actually gets set up as a training dataset, though. LAION for instance has a NSFW classifier score for each item, and people training foundation models will usually drop items that score high on that. The makers of the dataset almost certainly ran it against the NCMEC database at least, and anything in it that was discovered and taken down since then won't be accessible to anyone pulling images since the dataset is just a set of links. People who make foundation models do NOT want to have it get out that they may have had that in their training sets and tend to aggressively scrub NSFW content of all kinds.
It's a lot more likely that people are just relying on the AI to combine concepts to make that stuff (contrary to what everyone is saying you don't actually need something to be in the dataset if it exists as a combination of two concepts), or they are training it in themselves with a smaller dataset which sounds like what might have happened here. If there was anything like that in any of the common datasets used for training a large model from scratch, it would almost certainly have been found and reported by now; the only way that could conceivably not happen is if its captions have nothing to do with the content, which would not only also get the image dropped on its own, but would also make it entirely useless for training.
people training foundation models will usually drop items that score high on that
Not if they're freaks. And it isn't as though there's a shortage of that in the tech world.
People who make foundation models do NOT want to have it get out that they may have had that in their training sets and tend to aggressively scrub NSFW content of all kinds.
I doubt the bulk of them give it serious thought. The marketing and analytics guys, sure. But when that department is overrun with 4chan-pilled Musklovites, even that's a gamble. Maybe someone in Congress will blow a gasket and drop the hammer on this sooner or later, like Mark Foley did back in the '00s under Bush. But... idk, it feels like all the tech giants are untouchable now anyway.
they are training it in themselves with a smaller dataset which sounds like what might have happened here
Also definitely possible, which would get us much closer to a "this is obviously criminal misconduct because of how you obtained the dataset" rather than "this is obviously morally repugnant while living in a legal gray area".
I mean, I've seen what OpenAI did for their most recent models, they spent about half of the development time cleaning the dataset, and yes, a large part of their reason for that is that they want to present themselves as "the safe AI company" so they can hopefully get their competitors regulated to death. Stable Diffusion openly discloses what they use for their dataset and what threshold they use for NSFW filtering, and for SD2.0 they ended up using an extremely aggressive filter (filtering everything with a score above 0.1 where most would use something like 0.98 -- for example, here is an example of an absolutely scandalous photo that would be filtered under that threshold, taken from a dataset I assembled) that it actually made the model significantly worse at rendering humans in general and they had to train it for longer to fix it. Like... all evidence available points to them being, if anything, excessively paranoid.
CSAM exists on the internet and the web-crawlers that hoover up data aren't particular to their source of material. Its just one big vacuum absorbing every image or text file that isn't behind security.
Consequently, there's a ton of training data on how to reproduce and replicate these images.
That data goes through a lot of filtering and classification before it actually gets set up as a training dataset, though. LAION for instance has a NSFW classifier score for each item, and people training foundation models will usually drop items that score high on that. The makers of the dataset almost certainly ran it against the NCMEC database at least, and anything in it that was discovered and taken down since then won't be accessible to anyone pulling images since the dataset is just a set of links. People who make foundation models do NOT want to have it get out that they may have had that in their training sets and tend to aggressively scrub NSFW content of all kinds.
It's a lot more likely that people are just relying on the AI to combine concepts to make that stuff (contrary to what everyone is saying you don't actually need something to be in the dataset if it exists as a combination of two concepts), or they are training it in themselves with a smaller dataset which sounds like what might have happened here. If there was anything like that in any of the common datasets used for training a large model from scratch, it would almost certainly have been found and reported by now; the only way that could conceivably not happen is if its captions have nothing to do with the content, which would not only also get the image dropped on its own, but would also make it entirely useless for training.
Not if they're freaks. And it isn't as though there's a shortage of that in the tech world.
I doubt the bulk of them give it serious thought. The marketing and analytics guys, sure. But when that department is overrun with 4chan-pilled Musklovites, even that's a gamble. Maybe someone in Congress will blow a gasket and drop the hammer on this sooner or later, like Mark Foley did back in the '00s under Bush. But... idk, it feels like all the tech giants are untouchable now anyway.
Also definitely possible, which would get us much closer to a "this is obviously criminal misconduct because of how you obtained the dataset" rather than "this is obviously morally repugnant while living in a legal gray area".
I mean, I've seen what OpenAI did for their most recent models, they spent about half of the development time cleaning the dataset, and yes, a large part of their reason for that is that they want to present themselves as "the safe AI company" so they can hopefully get their competitors regulated to death. Stable Diffusion openly discloses what they use for their dataset and what threshold they use for NSFW filtering, and for SD2.0 they ended up using an extremely aggressive filter (filtering everything with a score above 0.1 where most would use something like 0.98 -- for example, here is an example of an absolutely scandalous photo that would be filtered under that threshold, taken from a dataset I assembled) that it actually made the model significantly worse at rendering humans in general and they had to train it for longer to fix it. Like... all evidence available points to them being, if anything, excessively paranoid.
Cool, you learn a new horrible truth about the corrupt underbelly of our civilization every day