The hidden reason AI costs are soaring—and it’s not because Nvidia chips are more expensive

In this article:

Building today’s massive AI models can cost hundreds of millions of dollars, with projections suggesting it could hit a staggering billion dollars within a few years. Much of that expense is for computing power from specialized chips—typically Nvidia GPUs, of which tens of thousands may be required, costing as much as $30,000 each.

But companies training AI models, or fine-tuning existing models to improve performance on specific tasks, also struggle with another often overlooked and rising cost: data labeling. This is a painstaking process in which generative AI models are trained with data that is affixed with tags so that the model can recognize and interpret patterns.

Data labeling has long been used to develop AI models for self-driving cars, for example. A camera captures images of pedestrians, street signs, cars, and traffic lights and human annotators label the images with words like “pedestrian,” “truck,” or “stop sign.” The labor-intensive process has also raised ethics concerns. After releasing ChatGPT in 2022, OpenAI was widely criticized for outsourcing the data labeling work that helped make the chatbot less toxic to Kenyans earning less than $2 hourly.

Today’s generic large language models (LLMs) go through an exercise related to data labeling called Reinforcement Learning Human Feedback, in which humans provide qualitative feedback or rankings on what the model produces. That is one significant source of rising costs, as is the effort involved in labeling private data that companies want to incorporate into their AI models, such as customer information or internal corporate data.

In addition, labeling highly technical, expert-level data in fields like legal, finance, and healthcare is driving up expenses. That’s because some companies are hiring high-cost doctors, lawyers, PhDs, and scientists to label certain data or outsourcing the work to third-party companies such as Scale AI, which recently secured a jaw-dropping $1 billion in funding as its CEO predicted strong revenue growth by year-end.

“You now need a lawyer to label stuff, [which is] a crazy use of legal hours,” said William Falcon, CEO of AI development platform Lightning AI. “Anything high stakes” requires expert-level labeling, he explained. “A chat with a ‘virtual BFF is not high stakes, providing legal advice is.”

Alex Ratner, CEO of data labeling startup Snorkel AI, says corporate customers can spend millions of dollars on data labeling and other data tasks, which can eat up 80% of their time and AI budget. Over time, data must also be relabeled to remain up to date, he added.

Matt Shumer, CEO and cofounder of AI assistant startup Otherside AI, agreed that fine tuning LLMs has gotten expensive. “Over the past couple of years, we’ve gone from middle-school-level data being okay to needing high school, college, and now expert,” he said. “That obviously doesn’t come cheap.”

That can create budget woes for tech startups building in important areas like healthcare. Neal Shah, CEO of CareYaya, a platform for elder caregivers, says his company received a grant from Johns Hopkins University to build “the world's first AI caregiver trainer for dementia patients,” but that data labeling costs are “eating us alive.” The cost, he said, has skyrocketed 40% over the past year because of the specialized information needed from gerontologists, dementia experts, and veteran caregivers. He’s working to reduce those costs by enlisting healthcare students and college professors to do the labeling.

Bob Rogers, CEO of Oii.ai, a data science company specializing in supply chain modeling, said he has seen data labeling projects that cost millions. Platforms like BeeKeeper AI, he said, can help lower costs by letting multiple companies share experts, data, and algorithms without exposing their private data to the others.

Kjell Carlsson, head of AI strategy at Domino Data Lab, added that some companies are lowering costs by using “synthetic” data—or data generated by the AI itself—to at least partially automate data collection and labeling. In some cases, models can fully automate any data labeling. For example, biopharma companies are training generative AI models for developing synthetic proteins for conditions like colo-rectal cancer, diabetes, and heart disease. The companies automatically conduct experiments based on the outputs of generative AI models, which provide new training data that comes with labels.

The bottom line, however, is that data labeling may be costly and time-intensive, but well worth it. “Data labeling's a beast,” said CareYaya’s Shah. “But the potential payoff is massive.”

Sharon Goldman

Want to send thoughts or suggestions to Data Sheet? Drop a line here.

This story was originally featured on Fortune.com

Advertisement