
Synthetic Data for AI Training: Error or Panacea?
Artificial intelligence has hit a ceiling: the volume of data available for training is limited and depleting rapidly. Consequently, startups are turning to synthetic data—information generated by other neural networks.
AI startup Anthropic has employed synthetic data to train one of its flagship models, Claude 3.5 Sonnet. Meta has refined its neural networks Llama 3.1 using AI-generated data. OpenAI also uses synthetic information to train o1, a “reasoning” artificial intelligence.
TechCrunch highlighted the advantages and disadvantages of this approach.
Annotation
Artificial intelligence systems are statistical machines. They learn from a large number of examples and study patterns for future predictions.
Annotations—textual labels indicating the meaning or parts of data—are a key element in these examples. They serve as guides, “teaching” the model to distinguish objects, places, and ideas.
For instance, if a neural network is shown numerous photographs of kitchens labeled with the word “kitchen,” it will eventually associate common features like the presence of a refrigerator or countertop. After training, the model will be able to recognize a kitchen photo it has never seen before.
During training, it is important to classify annotations correctly. For example, if images of kitchens are labeled as “cow,” the AI will associate a refrigerator with the animal.
The need for labeled data has created an entire market for annotation services, which is valued at $838.2 million and is expected to reach $10.34 billion over the next decade.
In some cases, data labeling requires specialized knowledge and experience, such as in mathematics. There are firms specializing in data annotation. Work in such companies can be either well-paid or poorly compensated. In developing countries, workers earn less than $2 per hour.
Replacing Humans
Paying data labelers can be expensive, and they are prone to errors. Additionally, obtaining information itself can be costly. Shutterstock charges AI providers tens of millions of dollars for access to its archives. Reddit has earned hundreds of millions from licensing information to Google, OpenAI, and others.
Finally, data is becoming increasingly difficult to obtain. More than 35% of the top 1,000 websites block access for OpenAI. If this trend continues, AI could exhaust all publicly available information by 2026–2032.
All this, along with the risks of lawsuits for using licensed information, has led to the necessity of generating synthetic information.
Synthetic Alternatives
If data is the new oil, synthetic information is positioned as biofuel, which can be created without negative external consequences, noted Os Keyes, a PhD candidate at the University of Washington.
“You can take a small starter dataset and model and extrapolate new information from it,” he noted.
The AI industry has adopted the technology and begun to apply it. In December, the company Writer introduced the Palmyra X 004 model, trained almost entirely on synthetic data. The development cost $700,000 compared to the $4.6 million OpenAI spent on creating a neural network of similar size.
Open models Phi from Microsoft were partially trained on synthetic data, as was Gemma from Google. This summer, Nvidia introduced a family of models designed to create synthetic training information, and AI startup Hugging Face released the “largest” dataset for AI tuning, consisting of artificial text.
The generation of synthetic data has become a business, with its value expected to grow to $2.34 billion by 2030.
Synthetic Risks
The use of synthetic data carries certain risks. If the information used to create artificial data is biased or limited, the result will be flawed.
Excessive use of synthetic data during neural network training leads to a decrease in model quality and diversity, according to a study by Rice and Stanford Universities.
Large neural networks like o1 are capable of creating hallucinations that are more difficult to detect, leading to reduced accuracy in AI trained on such data.
A study published in July shows that models trained on flawed data generate even more inaccurate information. This creates a degradation loop for subsequent neural networks. Consequently, artificial intelligence may provide answers entirely unrelated to the question.
Another study clearly demonstrated the decline in model performance using image examples.
Senior research scientist at the Allen Institute for Artificial Intelligence, Luca Soldaini, believes that the use of synthetic data is advisable if it is thoroughly checked, filtered, and compared with real information.
Failure to meet this requirement may lead to model collapse, making it less “creative” and more biased in its conclusions, ultimately significantly reducing its functionality.
“Synthetic data pipelines are not self-improving machines. Their results must be carefully checked and improved before being used for training,” he noted.
Previously, OpenAI CEO Sam Altman remarked that someday AI will create synthetic data good enough for effective self-learning.
Back in December, OpenAI co-founder Ilya Sutskever predicted the end of the era of pre-training artificial intelligence and foresaw the emergence of superintelligence.
Рассылки ForkLog: держите руку на пульсе биткоин-индустрии!