AI development is moving at a rapid pace, but it risks running headlong into a wall. As websites increasingly place barriers on scraping (some of which are allegedly ignored), and as the remaining content is voraciously collected by scrapers to train AI models, concerns are growing that we may run out of usable training data.
The industry’s answer? Synthetic data.
“Recently in the industry, synthetic data has been talked about a lot,” said Sebastien Bubeck, a member of technical staff at OpenAI, in the company’s livestreamed release of GPT-5 last week. Bubeck stressed its importance for the future of AI models—an idea echoed by his boss, Sam Altman, who live-tweeted the event, saying he was “excited for much more to come.”
The prospect of relying heavily on synthetic data hasn’t gone un