The content for training large language models and other AIs has been something I have written about before, with being able to opt out of being crawled by AI bots. The New York Times has updated it’s Terms and Conditions to disallow that – which I’ll get back to in a moment.
It’s an imperfect solution for so many reasons, and as I wrote before when writing about opting out of AI bots, it seems backwards.
In my opinion, they should allow people to opt in rather than this nonsense of having to go through motions to protect one’s content from being used as a part of a training model.
Back to the New York Times.
…The New York Times updated its terms of services Aug. 3 to forbid the scraping of its content to train a machine learning or AI system.
The content includes but is not limited to text, photographs, images, illustrations, designs, audio clips, video clips, “look and feel” and metadata, including the party credited as the provider of such content.
The updated TOS also prohibits website crawlers, which let pages get indexed for search results, from using content to train LLMs or AI systems…
“The New York Times Updates Terms of Service to Prevent AI Scraping Its Content“, Trishla Ostwal, Adweek.com, August 10th 2023.
This article was then referenced by The Verge, which added a little more value.
…The move could be in response to a recent update to Google’s privacy policy that discloses the search giant may collect public data from the web to train its various AI services, such as Bard or Cloud AI. Many large language models powering popular AI services like OpenAI’s ChatGPT are trained on vast datasets that could contain copyrighted or otherwise protected materials scraped from the web without the original creator’s permission…
“The New York Times prohibits using its content to train AI models“, Jess Weatherbed, TheVerge.com, Augus 14th, 2023.
That’s pretty interesting considering that Google and the New York Times updated their agreement on News and Innovation on February 6th, 2023.
This all falls into a greater context where many media organizations called for rules protecting copyright in data used to train generative AI models in a letter you can see here.
Where does that leave us little folk? Strategically, bloggers have been a thorn in the side of the media for a few decades, driving down costs for sometimes pretty good content. Blogging is the grey area of the media, and no one really seems to want to tackle that.
I should ask WordPress.com what their stance is. People on Medium and Substack should also ask for a stance on that.
Speaking for myself – if you want to use my content for your training model so that you can charge money for a service, hit me in the wallet – or hit the road.