The Future of AI Relies on a High School Teacher’s Free Database

The Future of AI Relies on a High School Teacher’s Free Database · Bloomberg

(Bloomberg) -- In front of a suburban house on the outskirts of the northern Germany city of Hamburg, a single word — “LAION” — is scrawled in pencil across a mailbox. It’s the only indication that the home belongs to the person behind a massive data gathering effort central to the artificial intelligence boom that has seized the world's attention.

Most Read from Bloomberg

That person is high school teacher Christoph Schuhmann, and LAION, short for “Large-scale AI Open Network,” is his passion project. When Schuhmann isn’t teaching physics and computer science to German teens, he works with a small team of volunteers building the world’s biggest free AI training data set, which has already been used in text-to-image generators such as Google’s Imagen and Stable Diffusion.

Databases like LAION are central to AI text-to-image generators, which rely on them for the enormous amounts of visual material used to deconstruct and create new images. The debut of these products late last year was a paradigm-shifting event: it sent the tech sector’s AI arms race into hyperdrive and raised a myriad of ethical and legal issues. Within a matter of months, lawsuits had been filed against generative AI companies Stability AI and Midjourney for copyright infringement, and critics were sounding the alarm about the violent, sexualized, and otherwise problematic images within their datasets, which have been accused of introducing biases that are nearly impossible to mitigate.

But these aren’t Schuhmann’s concerns. He just wants to set the data free.

Large Language

The 40-year-old teacher and trained actor helped found LAION two years ago after hanging out on a Discord server for AI enthusiasts. The first iteration of OpenAI’s DALL-E, a deep learning model that generates digital images from language prompts — say, creating an image of a pink chicken sitting on a sofa in response to such a request — had just been released, and Schuhmann was both inspired and concerned that it would encourage big tech companies to make more data proprietary.“I instantly understood that if this is centralized to one, two or three companies, it will have really bad effects for society,” Schuhmann said.

In response, he and other members on the server decided to create an open-source dataset to help train image-to-text diffusion models, a months-long process similar to teaching someone a foreign language with millions of flash cards. The group used raw HTML code collected by the California nonprofit Common Crawl to locate images around the web and associate them with descriptive text. It does not use any manual or human curation.