robots.txt

The content for training large language models and other AIs has been something I have written about before, with being able to opt out of being crawled by AI bots. The New York Times has updated it’s Terms and Conditions to disallow that – which I’ll get back to in a moment.

It’s an imperfect solution for so many reasons, and as I wrote before when writing about opting out of AI bots, it seems backwards.

In my opinion, they should allow people to opt in rather than this nonsense of having to go through motions to protect one’s content from being used as a part of a training model.

Back to the New York Times.

…The New York Times updated its terms of services Aug. 3 to forbid the scraping of its content to train a machine learning or AI system.

The content includes but is not limited to text, photographs, images, illustrations, designs, audio clips, video clips, “look and feel” and metadata, including the party credited as the provider of such content.

The updated TOS also prohibits website crawlers, which let pages get indexed for search results, from using content to train LLMs or AI systems…
“The New York Times Updates Terms of Service to Prevent AI Scraping Its Content“, Trishla Ostwal, Adweek.com, August 10th 2023.

This article was then referenced by The Verge, which added a little more value.

…The move could be in response to a recent update to Google’s privacy policy that discloses the search giant may collect public data from the web to train its various AI services, such as Bard or Cloud AI. Many large language models powering popular AI services like OpenAI’s ChatGPT are trained on vast datasets that could contain copyrighted or otherwise protected materials scraped from the web without the original creator’s permission…
“The New York Times prohibits using its content to train AI models“, Jess Weatherbed, TheVerge.com, Augus 14th, 2023.

That’s pretty interesting considering that Google and the New York Times updated their agreement on News and Innovation on February 6th, 2023.

This all falls into a greater context where many media organizations called for rules protecting copyright in data used to train generative AI models in a letter you can see here.

Where does that leave us little folk? Strategically, bloggers have been a thorn in the side of the media for a few decades, driving down costs for sometimes pretty good content. Blogging is the grey area of the media, and no one really seems to want to tackle that.

I should ask WordPress.com what their stance is. People on Medium and Substack should also ask for a stance on that.

Speaking for myself – if you want to use my content for your training model so that you can charge money for a service, hit me in the wallet – or hit the road.

Those of us that create anything – at least without the crutches of a large language model like ChatGPT- are a bit concerned about our works being used to train large language models. We get no attribution, no pay, and the companies that run the models basically can just grab our work, train their models and turn around and charge customers for access to responses that our work helped create.

No single one of us is likely that important. But combined, it’s a bit of a rip off. One friend suggested being able to block the bots, which is an insurmountable task because it depends on the bots obeying what is in the robots.txt file. There’s no real reason that they have to.

“How to Block AI Chatbots From Scraping Your Website’s Content” is a worthwhile guide to attempting to block the bots. It also makes the point that maybe it doesn’t matter.

I think that it does, at least in principle, because I’m of the firm opinion that websites should not have to opt out of being used by these AI bots – but rather, that websites should opt in as they wish. Nobody’s asked for anything, have they? Why should these companies use your work, or my work, without recompense and then turn around and charge access to these things?

Somehow, we got stuck with ‘opting out’ when what these companies running the AI Bots should have done is allow people to opt in with a revenue model.

TAANSTAAFL. Except if you’re a large tech company, apparently.

On the flip side, Zoom says that they’re not using data from users for their training models. Taken at face value, that’s great, but the real problem is that we wouldn’t know if they did.

KnowProSE.com

Where one line can make a difference.

NYT Says No To Bots.

Blocking AI Bots: The Opt Out Issue.