A team of more than two dozen AI researchers from MIT, Cornell University, the University of Toronto, and other institutions have trained a large language model only using data that was openly licensed or in the public domain, the Washington Post reports, providing a blueprint for ethically developing the technology.
But, as the creators readily admit, it was far from easy.
As they describe in a yet-to-be-peer-reviewed paper published this week, it quickly became apparent that it wouldn't be computing power holding them back, but personpower.
That's because the text in the over eight terabyte dataset they put together, which they're calling the Common Pile v0.1, had to be manually cleaned up and reformatted to make it suitable for AI training, WaPo explains. Then there was the amazing a