It’s no secret that the generative, sequacious artificial intelligences out there have copyright issues. I’ve written about it myself quite a bit.
It’s almost become cliche to mention copyright and AI in the same sentence, with Sam Altman having said that there would be no way to do generative AI without all that material – toward the end of this post, you’ll see that someone proved that wrong.
“Copyright Wars pt. 2: AI vs the Public“, by Toni Aittoniemi in January of 2023, is a really good read on the problem as the large AI companies have sucked in content without permission. If an individual did it, the large companies doing it would call it ‘piracy’, but now, it’s… not? That’s crazy.
The timing of me finding Toni on Mastodon was perfect. Yesterday, I found a story on Wired that demonstrates some of what Toni wrote last year, where he posed a potential way to handle the legal dilemmas surrounding creator’s rights – we call it ‘copyright’ because someone was pretty unimaginative and pushed two words together for only one meaning.
In 2023, OpenAI told the UK parliament that it was “impossible” to train leading AI models without using copyrighted materials. It’s a popular stance in the AI world, where OpenAI and other leading players have used materials slurped up online to train the models powering chatbots and image generators, triggering a wave of lawsuits alleging copyright infringement.
“Here’s Proof You Can Train an AI Model Without Slurping Copyrighted Content“, Kate Knibbs, Wired.com, March 20th, 2024
Two announcements Wednesday offer evidence that large language models can in fact be trained without the permissionless use of copyrighted materials.
A group of researchers backed by the French government have released what is thought to be the largest AI training dataset composed entirely of text that is in the public domain. And the nonprofit Fairly Trained announced that it has awarded its first certification for a large language model built without copyright infringement, showing that technology like that behind ChatGPT can be built in a different way to the AI industry’s contentious norm.
“There’s no fundamental reason why someone couldn’t train an LLM fairly,” says Ed Newton-Rex, CEO of Fairly Trained. He founded the nonprofit in January 2024 after quitting his executive role at image-generation startup Stability AI because he disagreed with its policy of scraping content without permission….
It struck me yesterday that a lot of us writing and communicating about the copyright issue didn’t address how it could be handled. It’s not that we don’t know that it couldn’t be handled, it’s just that we haven’t addressed it as much as we should. I went to sleep considering that and in the morning found that Toni had done much of the legwork.
What Toni wrote extends on the system:
…Any training database used to create any commercial AI model should be legally bound to contain an identity that can be linked to a real-world person if so required. This should extend to databases already used to train existing AI’s that do not yet have capabilities to report their sources. This works in two ways to better allow us to integrate AI in our legal frameworks: Firstly, we allow the judicial system to work it’s way with handling the human side of the equation instead of concentrating on mere technological tidbits. Secondly, a requirement of openness will guarantee researches to identify and question the providers of these technologies on issues of equality, fairness or bias in the training data. Creation of new judicial experts at this field will certainly be required from the public sector…
“Copyright Wars pt. 2: AI vs the Public”, Toni Aittoniemi, Gimulnautti, January 13th, 2023.
This is sort of like – and it’s my interpretation – a tokenized citation system built into a system. This would expand on what, as an example, Perplexity AI does by allowing style and ideas to have provenance.
This is some great food for thought for the weekend.
One thought on “Copyright, AI, And, It Doing It Ethically.”