index

initial thoughts on stable diffusion's dataset

marrow #5 | posted 07.09.2022 | last updated 19.12.2022

i've come across this little article about stable diffusion's training dataset. unlike dall-e's openAI, stability is rather transparent about this stuff, which is great. so. i wanna talk about the dataset — or rather, the fraction of the dataset that's been organized and can be browsed. read the article first and then come back so i don't have to paraphrase it all lmao


the fraction we can browse is composed of 12 million image-caption pairs, or "2% of the 600 million images used to train the most recent three checkpoints". so it's a lot but doesn't even scratch the surface.

the way this data was collected (web scraping for img tags with alt text; captions are mostly in english) absolutely shows; you can notice a few things:

the article says the largest number of images comes from pinterest, and yeah you can see that. shopping sites, stock images and stuff like blogspot and flickr are also heavy contributors. but since even the non-pinterest stuff is the kind of stuff that's also on pinterest you could honestly just say stable diffusion is trained on pinterest soup. it's hyperbole but ehh not by much? that's just my opinion though!


well so now! what do i think about this? it's... kinda tricky. on the one hand, the idea of web scraping itself can seem rather scary (but it's also what makes search engines work!). it also makes for a bit of a shit dataset, i'll get there. on the other... well let's talk about copyright and permission?

for starters, here's an interesting video about copyright abolition. if you've been around art social media for longer than 24 hours, and specially if you had to endure the height of NFTs you know damn well copyright won't do anything for you. it's there to keep massive media monopolies profits; no one gives a shit if someone's reposting your art on twitter and then that gets pinned to pinterest and that in turn gets reposted to pinterest itself etc. hell, a bunch of stuff in this dataset is just that. maybe i'm just a nobody online artist, but what are you losing? money? clout? sure, i wouldn't like my art to be shared around without its context either, but it's interesting to interrogate why that is.

but in any case, that's neither here nor there, because that's not what stable diffusion does. you see, a model doesn't store image data. here's a great non-technical explanation, but essentially the images become mathematical mush, and it's a lossy process, meaning the original images comprising the training data aren't exactly in the model at all. wanna see that in practice? looking at the dataset, you can find two images captioned Two Young Women Kissing, Leopold Boilly 1790-1794 (and a few extra words). here are two images generated with simple stable from this prompt (as 50% quality jpegs for compression):

AI-generated image; profile view of two white women in grey dresses, kissing against a dark background, styled like a classical oil painting AI-generated image; profile view of two white women, bare shoulders, kissing against a dark background, styled like a classical oil painting

as you can see, these images fit the prompt rather well, but are far from being copies of the original! you see, the dataset also has several other image-text pairs including "women", "kissing", "1790", of course the image will get mixed with other stuff inside the black box. the prompt doesn't include anything about the background / room, so it just focuses on the kiss instead, changing the composition accordingly. and this was not the only prompt i've tried! it's basically impossible to pluck a single image out of the model. the only way to modify a specific stolen image with a GAN is by directly linking it as an initial image — and that's got nothing to do with the generator itself, and has been done through tracing and photoshopping for as long as there's been art online.

so like. dataset ethics is its own can of worms, as is web scraping and the collection of all this "publicly" available data! there's a huge discussion there: archiving is acceptable use, but then can't we use the archive? do we need to use the archive? how can we find alternatives to have easily accessible generators without resorting to massive and indiscriminate data harvesting? the technological cat is out of the bag, where do we go from here? there's a lot of stuff that i'm just glossing over here; because my point right now is that GANs are not automated art thieves. all this without even having to discuss art history! (but here's a comic by jaakko pallasvuo that touches on that!)


this release is the culmination of many hours of collective effort to create a single file that compresses the visual information of humanity into a few gigabytes.

this is what the stability team says in the stable diffusion public release. as i've said before, AI researchers looooove to attempt to make a map that's the size of the territory (i'm pretty sure that's a story on invisible cities but i can never find it by just looking it up, and i don't have a physical copy. boo). we all know that's impossible. this model is trained on a snapshot of a particular section of the world: internet images, captioned in english, supposedly after a filter attempts to get rid of the lowest fidelity captions. it's an intrinsically flawed dataset, because all generalist datasets have to be.

still, i think this is a way better model / dataset than the heavily censored dall-e stuff. especially when it comes to artistic freedom. the world is messy, it includes nudity and sex and celebrities' faces and blood and trademarked characters! i don't want to defend making vile stuff or whatever but you get what i mean, right? its potential will always be shaped by the fact it relies on random ass internet images that were not captioned with machine learning in mind (many of them aren't even made with accessibility in mind!). it is pinterest goop! it carries biases as well, as they all do, and as the bulk of the internet does... it's a very white dataset, for starters, and focusing on english-language captions also has its impact. this stuff always needs to be addressed when it comes to machine learning, lest we tech-wash harmful worldviews. but still, nice looking art can be cajoled out of it (let's please not get into a "what is art" discussion).

i don't know how to close this off. i recommend you poke around the dataset, it's pretty interesting. i just wanted to talk about it, especially because it's allowed me to expand and rethink things i've said before about plagiarism and ML. um yeah. i might come back to this later but for now these are my thoughts on the subject!


update: diffusion art or digital forgery?

ok so today (december 19) (right after waking up, it was one of the first things i read ← nerd) i came across this paper called diffusion art or digital forgery? investigating data replication in diffusion models, that is basically what would happen if someone took this mine article and made it Science, which is very cool for me! please give it a read (it's fairly short; though they should have compressed it with ilovepdf beforehand, it's 30 whole megabytes). i'll be quoting from the paper as well, but i don't want to mischaracterize what they're saying (please don't think i'm trying to "debunk" it holy shit), and i won't go into the more technical matters (because i'm out of my depth in that regard), which is why i encourage you read it yourself. also the images of course.

anyway, from the abstract:

In this work, we study image retrieval frameworks that enable us to compare generated images with training samples and detect when content has been replicated. Applying our frameworks to diffusion models trained on multiple datasets including Oxford flowers, Celeb-A, ImageNet, and LAION, we discuss how factors such as training set size impact rates of content replication. We also identify cases where diffusion models, including the popular Stable Diffusion model, blatantly copy from their training data.

they don't attempt to answer the titular question, which i think is a good call because that's its own can of worms etc. (although their wording choices throughout kiinda give away their opinions i think, but that's neither here nor there). i probably won't either (and if i do, remember i am Just Some Guy)! i want to 1. talk about their findings and how they relate to my experiences from september and 2. reflect on how this new information impacts my thoughts on the matter of art specifically. the Current Discourse Zeitgeist makes this a little uncomfortable — do i have to address That fundraiser (tangent but i've for a while now distrusted a certain big name deviantart artist and it's kinda sad that i was right to) — but it is fun stuff and i won't turn it into a rant, i prommy ^v^

the paper

Replicants are either a benefit or a hazard; there may be situations where content replication is acceptable, desirable, or fair use, and others where it is “stealing.” While these ethical boundaries are unclear at this time, we focus on the scientific question of whether replication actually happens with modern state-of-the-art diffusion models, and to what degree.

what the team does is: figure out how to detect content replication, and then search for that behavior in a few different models and datasets. they find more replication in small and medium datasets, and not as much in the large ImageNet, but stable diffusion is a large dataset and exhibits a lot of replication. like myself, they use the smaller laion aesthetics subset, because the whole dataset is just that massive. here's their definition of replication:

We say that a generated image has replicated content if it contains an object (either in the foreground or background) that appears identically in a training image, neglecting minor variations in appearance that could result from data augmentation.

they're focusing on what would be likely to become subject to a copyright / intelectual property dispute; it's the "automated collage machine" stuff basically. they don't quite address the artstation-digital-concept-artists-from-california elephant in the room, showcasing mostly photos rather than artwork — possibly because product photographs are way way more frequent / duplicated in the dataset, possibly because of ethical questions? idk —, which is a bit of a shame for our purposes of discussing art, and makes some of the images kinda hard for me to judge. a close up photo of a daisy is a photo of a daisy is a photo of a daisy, and these datasets (celebrity faces, the flowers) were never really meant for art or whatever. anyway!

i'm gonna focus specifically on their case study of stable diffusion here because. you know, it's what ties in with my article, it's the most interesting part of the paper and it's the hot new thing.

We attempt to answer the following questions in this analysis. 1) Is there copying in the generations? 2) If yes, what kind of copying? 3) Does a caption sampled from the training set produce an image that matches its original source? 4) Is content replication behavior associated with training images that have many replications in the dataset?

so! image 7 of this paper is a fun little table! it's a tiny sample of the 170 generated images with a similarity score above 0.5 (wish they'd release more of the images, i'm curious!), and you mostly can't really deny these are copies. the thing is, they're not copies of the images they "should" be copying: "While all synthetic images were generated using captions sourced from LAION, none of the generations match their respective source image.". which is fascinating, but also: if you've looked at the organized fraction of the dataset on that link above, the captions are kinda abysmal sludge instead of proper image descriptions, because of the nature of how this dataset was constructed.

In those 170 images, we find instances where replication behavior is highly dependent on key phrases in the caption. We show two examples in Figure 10 and highlight the key phrase in red. For the first row, the presence of the text Canvas Wall Art Print frequently (≈ 20% of the time) results in generations containing a particular sofa from LAION (also see Fig 1). [...]

idk just wanted to mention that. again it just makes a ton of sense to me, considering the source of these image-caption pairs, that there'd be a lot of automatically generated mockup product photos from sites like redbubble et al. these key phrases (like the infamous "[trending on] artstation") are probably repeated enough that they become overpowering, whereas simply copying a caption might not do much (see my extremely unscientific experiments above).

Surprisingly, we see that a typical random image from the dataset is duplicated 11.6 times, which is more often than a typical matched image, which is duplicated 3.1 times. However, if we look only at very close matches (> .5 SSCD), these match images are replicated on average 34.1 times – far more often than a typical image. It seems that replicated content tends to be drawn from training images that are duplicated more than a typical image.

tied with the previous quote. there are a lot of duplicate images in this dataset, that's just empyrical data. nice to see it backed up with Numbers, wow. anyway, we / they don't really have an answer as to why stable diffusion has so much replication, but they speculate it's a mix of 1. being text (rather than class) conditioned, 2. the skewed distribution of repetitions in the training set and 3. overfitting. i don't really know enough about any of this to comment though.

some conclusions

The goal of this study was to evaluate whether diffusion models are capable of reproducing high-fidelity content from their training data, and we find that they are. While typical images from large-scale models do not appear to contain copied content that was detectable using our feature extractors, copies do appear to occur often enough that their presence cannot be safely ignored; Stable Diffusion images with dataset similarity ≥ .5, as depicted in Fig. 7, account for approximate 1.88% of our random generations.

Note, however, that our search for replication in Stable Diffusion only covered the 12M images in the LAION Aesthetics v2 6+ dataset. The model was first trained on over 2 billion images, before being fine-tuned on the 600M LAION Aesthetics V2 5+ split. The dataset that we searched in our study is a small subset of this fine-tuning data, comprising less than 0.6% of the total training data.

Examples certainly exist of content replication from sources outside the 12M LAION Aesthetics v2 6+ split – see Fig 12. Furthermore, it is highly likely that replication exists that our retrieval method is unable to identify. For both of these reasons, the results here systematically underestimate the amount of replication in Stable Diffusion and other models.

well, not really gonna comment much on that. i agree, the presence of copies in the output shouldn't be ignored. regardless of my thoughts on copyright, i think it's a problem that your algorithm can so directly spit out a scraped image like that. to be fair, specially after reading the paper, i also think stable diffusion is kind of a shit model due to the pinterest-sludge nature of its dataset; can't build a house on sand. not that it can't be used to create good art, any tool can be useful in the right hand and so on.

of course, the artist's anxiety isn't really whether or not your work can be plucked out of the dataset, or at least i don't think it should be. because there's always good old art theft for that? don't need to go through all the extra effort. i guess the anxiety is mostly about style, subject matter, some intangible Vibe, some intangible So-And-So-Ness. i don't think the paper substantially addresses any of that, sadly. it'd be really interesting!

i think the paper stands on its own, kinda "besides" discussions of the art world. it has confirmed some of my feelings / hunches on stable diffusion, the mismatched source-generated pairs are just fascinating, but um. what about The Disc Horse? my current thoughts on the matter are as follows:

if the output of a generator is effectively the same as the input (ie what's in the dataset), then it doesn't matter that it was "uncontrollable" or "accidental", it's a copy. claiming otherwise is just kinda silly, although the lines determining what exactly Is a copy gets fuzzy. "i know it when i see it" etc just... complicates matters.

copyright / intellectual property law is Not Gonna Save Us. it's meant for the big corporations, it always has been. which means: a blanket tightening of ip laws can only fuck small artists over, rather than "protect us"... from what exactly? anyway, rather than prevent people making generative copies of your work from profiting

"good artists copy, great artists steal" or whatever. i don't think art and ownership can quite live in the same house. it's complicated goddamit! it's a whole thing! so. i think it's fair to not want your work in a dataset. i think the way datasets currently function is quite fucked, i think the way research is mostly in the hands of for-profits is quite fucked, i think everything is quite fucked if you ask me.

there's bad art / bad artists everywhere. trying to come up with a definition of art that includes just the good stuff and excludes all the bad stuff is impossible and pointless, it's a lot more worthwhile to think about what a certain image / etc is doing, its themes and aesthetics and so on. i don't think "is the 'blonde woman, big boobs, trending on artstation' guy an artist?" is a fruitful discussion at all. it's so easy to ignore bad art, i think that's what we should be doing ^v^ (of course this is impossible if you get most of your art / artist interaction from the website that makes numbers go up if you're annoying and combative. just disengage instead of making a dozen posts and comics mocking them etc. god i hope twitter implodes and dies soon).

speaking of twitter, i think a lot of artists seem to equate clout / visibility with profit which... up to an extent i guess, if you rely on social media to get clients / freelance work but. hmmm... idk can a potential follower that didn't see your work / your profile (because they saw a repost of it instead) be considered lost money? can this be considered theft? it kinda sounds like "piracy costs corporations $12974 trillion in lost revenue every second"

uh yeah i think that was all for now? mostly wanted to share the paper i guess. might revisit in the future etc etc


index