Alternatively they could train on synthetic data like summaries and QA pairs extracted from protected sources, so the model gets the ideas separated from their original expression. Since it never saw the originals it can't regurgitate them.
However, besides how this sidesteps the fact that current copyright law violates the constitutional rights of US citizens, I imagine there is a very real threat of the clean model losing the fidelity of insight that the dirty model develops by having access to the base training data.
I think most people sidestep this as it's the first I've heard of it! Which right do you think is being violated and how?
Copyright is not “you own this forever because you deserve it”, copyright is “we’ll give you a temporary monopoly on copying to give you an incentive to create”. It’s transactional in nature. You create for society, society rewards you by giving you commercial leverage for a while.
Repeatedly extending copyright durations from the original 14+14 years to durations that outlast everybody alive today might technically be “limited times” but obviously violates the spirit of the law and undermines its goal. The goal was to incentivise people to create, and being able to have one hit that you can live off for the rest of your life is the opposite of that. Copyright durations need to be shorter than a typical career so that its incentive for creators to create for a living remains and the purpose of copyright is fulfilled.
In the context of large language models, if anybody successfully uses copyright to stop large language models from learning from books, that seems like a clear subversion of the law – it’s stopping “the progress of science and useful arts” not promoting it.
(To be clear, I’m not referring to memorisation and regurgitation like the examples in this paper, but rather the more commonplace “we trained on a zillion books and now it knows how language works and facts about the world”.)
It's late so I don't feel like repeating it all here, but I definitely recommend searching for Doctorow's thoughts on the DMCA, DRM and copyright law in general as a good starting point.
But generally, the idea that people are not allowed to freely manipulate and share data that belongs to them is patently absurd and has been a large topic of discussion for decades.
You've probably at least been exposed to how copyright law benefits corporations such as Disney, and private equity, much more than it benefits you or I. And how copyright law has been extended over and over by entities like Disney just so they could prolong their beloved golden geese from entering public domain as long as possible; far, far longer than intended by the original spirit of the copyright act.
Training on synthetic data is interesting, but how do you generate the synthetic data? Is it turtles all the way down?
In that case the model would lose the ability to provide relatively brief quotes from copyrighted sources in its answers, which is a really helpful feature when doing research. A brief quote from a copyrighted text, particularly for a transformative purpose like commentary is perfectly fine under copyright law.