Data Laundering: How Stability AI managed to get millions of copyrighted art works without paying artists

Data Laundering: How Stability AI managed to get millions of copyrighted art works without paying artists - Musai

Home

agosto 22, 2023 musaiio Comments Off

Understanding the very creative ways tech gets around copyright to train Large Models.

In my collaboration with the Gradient, an amazing publication started at Stanford, I wrote the article: Artists enable AI art – shouldn’t they be compensated? In terms of audience numbers, we’ve come a long way since those days. I was speaking to a few of the newer readers, and many of them were unaware of the ideas discussed in that article. In this piece, I will be going over one of the major ideas discussed in that piece: Data Laundering and how Tech Companies use it to get around copyright to build their large datasets (using quotes from the original). We have seen many companies and organizations try to build their language models. Given the trend towards multi-modality- we will likely see more orgs use such techniques to work around copyright (and I’m sure many already have). Understanding how is important for building better regulations and procedures to ensure everyone is paid better.

Key Highlights

What is Data Laundering- Data Laundering is the conversion of stolen data so that it may be sold or used by ostensibly legitimate databases. “As with other forms of data theft, data harvested from hacked databases is sold on darknet sites. However, instead of selling to identity thieves and fraudsters, data is sold into legitimate competitive intelligence and market research channels.” – Source- ZDnet, Cyber-criminals boost sales through ‘data laundering’.

How Stability AI got around artist copyright- In the case of Stability AI and AI art, the process plays out like this:

Create or fund a non-profit entity to create the datasets for you. The non-profit, research-oriented nature of these entities allows them to use copyrighted material more easily.

Then use this dataset to create commercial products, without offering any compensation for the use of copyrighted material.

Think, I’m making things up? Think back to Stable Diffusion, Stability’s AI text-to-image generator. Who created it? Many people think it’s Stability AI. You’re wrong. It was created by the Ludwig Maximilian University of Munich, with a donation from Stability. Look at the Github of Stable Diffusion to see for yourself

Stable Diffusion is a latent text-to-image diffusion model. Thanks to a generous compute donation from Stability AI and support from LAION, we were able to train a Latent Diffusion Model on 512×512 images from a subset of the LAION-5B database.

So the non-profit created the dataset/model, and the company then worked to monetize it. As noted in AI Data Laundering: How Academic and Nonprofit Researchers Shield Tech Companies from Accountability:

“A federal court could find that the data collection and model training was infringing copyright, but because it was conducted by a university and a nonprofit, falls under fair use…

Meanwhile, a company like Stability AI would be free to commercialize that research in their own DreamStudio product, or however else they choose, taking credit for its success to raise a rumored $100M funding round at a valuation upwards of $1 billion, while shifting any questions around privacy or copyright onto the academic/nonprofit entities they funded.”

To their credit, Stability has started trying to use more licensed datasets since the release of Stable Diffusion v1. But they are far from the only ones relying on such tactics.

Why this is not Fair Use- In traditional fair use, attribution is given to the original creator. Not attributing something has been given another, less flattering name- plagiarism. Adjusting models to acknowledge their ‘sources’ would be an imperfect but great start. My article with the Gradient discusses how that can be implemented with AI Art (since then various organizations have done something similar- so yippee!). However, we need to look at extending that beyond just AI Art (the way Deepmind trained Gato might provide some promising insights). I plan to do a detailed dive into multi-modal joint-embeddings soon

What can be done till then- Till we can come up with refined attribution systems, I’m going to suggest something heretical- just pay the damn people whose work you use to create solutions. We have writers striking against Hollywood’s exploitation. In our sister publication- AI Made Simple- I’ve covered multiple research that shows how high-quality data is the highest ROI for developing performant models (for the most recent example, look at my breakdown of the excellent LIMA paper).

Antagonizing the people whose work provides us with high-quality data just sounds like a very stupid business decision. As Adam Smith wrote in his seminal work- An Inquiry into The Wealth of Nations (very interesting and surprisingly insightful book fyi)- wealth creation is not a zero-sum game. We can all eat.

Follow: musai.io, Ethical AI for creators: DIGITAL ETHICAL STUDIES WITH AI FOR CREATIVE INDUSTRIES: salesLatam@musai.io

Previous « The scary truth about AI copyright is nobody knows what will happen next

AI ethics: How marketers should embrace innovation responsibly » Next