Meet Nightshade, the new tool allowing artists to ‘poison’ AI models with corrupted training data

hexaflexagonbear [he/him] · 1 year ago

Meet Nightshade, the new tool allowing artists to ‘poison’ AI models with corrupted training data

drhead [he/him] · 1 year ago

Alright, Hexbear resident AI poster here, and person who read the paper:

This works by attacking the part of a model that extracts features (think high-level concepts you can describe with words) from images. For Stable Diffusion, this is CLIP (v1.5), OpenCLIP (v2.1), or both (XL, for some reason), but other models can use different ones like DeepFloyd IF using T5 (which was tested in this paper). It takes a source image, and a generated image with features that are very different from that. It extracts the features from that second image, and perturbs the first image so that the image looks close to what it already looked like, and applies the minimum necessary alterations so that the feature extractor thinks it looks like something else.

The resulting image will look like it has artifacting on it, so it should be noticeable to anyone who is looking for adversarial noise manually. On larger datasets, nobody has time for that. It may ruin the look of images that are supposed to be flat colored (while also being easier to remove from those images by thresholding the image). This was a very common complaint with Glaze, and it seems that things have not improved on that front much.

As for how effectively this can be filtered or counteracted, I have some doubts about what the paper says about countermeasures. A lot of these are going to be speculation until they actually release a model, because currently nobody is able to generate Nightshade poisoned images to test with besides the authors. One of the methods they tested is checking CLIP similarity scores between captions and images in the dataset -- this makes the most sense as a countermeasure since it is the vector of the attack and is already something that is done to filter out poor-quality matches. They compared an attack where the wrong captions are given ("dirty-label") to one where they use their method. They claim that their CLIP filtering on the dirty label attack has 89% recall with a 10% false positive rate with the control data being a clean LAION dataset. I have worked with LAION. I filter and deduplicate any data that I pull before actually using it. I can say, from experience, that that 10% is most likely not a false positive rate -- LAION contains a lot of low quality matches, and it contains a lot of images that have been replaced with placeholders too. The threshold that I used last ended up dropping about 25% of the dataset. So when they say 47% recall and 10% FPR for doing the same thing to filter Nightshade images, I am inclined to believe they used a threshold that is too low. Notably, they do not disclose the threshold used, and clearly only tested one.

A second concern is that no form of image transformations attempting to remove the adversarial noise are covered. It's difficult to test things on this front without them releasing the model or a public demo of any sort, but I know some people who have had success in making AI-generated images pass as not AI generated by using an Open Image Denoise pipeline (where you add some amount of gaussian noise to the image then let a deep-learning based filter remove it). I do strongly suspect that this would work for removing the adversarial denoise, and I and probably others will try to test that out. There was a widely publicized 12-line python program that removed Glaze, so it's actually somewhat concerning that the authors wouldn't want to get ahead of speculation on this front. The result also doesn't need to look pretty if we're limiting the scope to filtering the dataset: probably one of the better ways to counteract it would be to discover some sloppy transformation that wipes out the noise and leaves the rest of the image recognizable, then see if that image has much difference with the original (potentially poisoned) image.

Third, it doesn't seem that they've covered what happens if CLIP is unfrozen during training. This isn't something you'd always be able to do (training from scratch, for example, will require that you have CLIP frozen so the diffusion component can get aligned to it, and I have noticed that CLIP can undergo some pretty severe damage if you make certain drastic changes to the model with it unfrozen), but it's pretty common practice for people training LoRAs or full finetunes of SD to unfreeze CLIP so that it can learn to interpret text differently. If you unfreeze CLIP, and you start passing it images of cats which to the current model look like "picture of a dog" (with probably some aspects of "picture of a cat"), then as you train the model you would be telling the model it is wrong when it treats the picture of a cat like a picture of a dog, and you would be updating the weights to differentiate those concepts better, and it should in theory render Nightshade ineffective over time. Again, this is not explored in the paper. It also isn't guaranteed to be without side effects, because I have seen damage done to CLIP on certain training regimens without people actively trying to damage it.

As a final note -- part of what enabled this attack to be developed was open-source AI models that can be run locally, where researchers can actually look at the individual components in some detail. Any attack like this is going to be far less effective on models like Midjourney, DALL-E 3, or Imagen because we only know what those companies disclose about them and have no way of even running then locally, let alone figuring out what something will do to training. I would be cautious about declaring this to be a victory, because a lot of developments like this have more potential to tilt things in favor of large AI companies than they do to slow down development of the whole field, and a lot of the larger companies are aware of this.

LesbianLiberty [she/her] · 1 year ago

Thank you for your knowledge contributions