Data to train AI models is becoming restricted. Here's why.

In this article:

A recent Data Provenance study reveals that online sources are increasingly restricting access to their data for AI model training purposes. This trend raises questions about the future development and capabilities of AI language models. Shayne Longpre and Robert Mahari, co-leads of the paper "Consent in Crisis: The Rapid Decline of the AI Data Commons," join Asking for a Trend to discuss the implications of this development.

Longpre notes that AI models require "very tremendous capabilities" to perform their tasks effectively. To achieve this level of performance, these models heavily rely on web-based data, which provides the necessary scale of information. However, Longpre points out a growing challenge: the increasing restrictions on access to this data. While these restrictions are "not currently legally enforceable," he suggests that ethical practices within the AI community have the potential to reshape data access policies.

In response to increasing data access restrictions, some companies have attempted to train AI models using outputs from other AI models. However, Mahari points out that this approach has proven to be "a challenge." He emphasizes that this issue extends beyond AI challenges, but creates legal challenges as well. "Generally, AI providers restrict the use of their services to create competing services, so we're seeing some tensions there," Mahari explains.

Despite these obstacles, Mahari notes a growing trend: companies are increasingly turning to synthetic data for training large language models.

For more expert insight and the latest market action, click here to watch this full episode of Asking for a Trend.

This post was written by Angel Smith

Advertisement