If you've ever wondered whether that chatbot you're using knows the entire text of a particular book, answers are on the way. Computer scientists have developed a more effective way to coax memorized content from large language models, a development that may address regulatory concerns while helping to clarify copyright infringement claims arising from AI model training and inference.

Researchers affiliated with Carnegie Mellon University, Instituto Superior Técnico/INESC-ID, and AI security platform Hydrox AI describe their approach in a preprint paper titled "RECAP: Reproducing Copyrighted Data from LLMs Training with an Agentic Pipeline."

The authors – André V. Duarte, Xuying Li, Bin Zeng, Arlindo L. Oliveira, Lei Li, and Zhuo Li – argue that the ongoing concerns about AI models being

See Full Page