Wikipedia, and It’s Trouble with LLMs.

Wikipedia, a wonderful resource despite all the drama that comes with the accumulation of content, is having some trouble dealing with the the large language model (LLMs) AIs out there. There are two core problems – the input, and the output.

“…The current draft policy notes that anyone unfamiliar with the risks of large language models should avoid using them to create Wikipedia content, because it can open the Wikimedia Foundation up to libel suits and copyright violations—both of which the nonprofit gets protections from but the Wikipedia volunteers do not. These large language models also contain implicit biases, which often result in content skewed against marginalized and underrepresented groups of people

The community is also divided on whether large language models should be allowed to train on Wikipedia content. While open access is a cornerstone of Wikipedia’s design principles, some worry the unrestricted scraping of internet data allows AI companies like OpenAI to exploit the open web to create closed commercial datasets for their models. This is especially a problem if the Wikipedia content itself is AI-generated, creating a feedback loop of potentially biased information, if left unchecked…” 

AI Is Tearing Wikipedia Apart“, Claire Woodcock, Vice.com, May 2nd, 2023.

The Input into Wikipedia.

Inheriting the legal troubles of companies that built AI models by taking shortcuts seems like a pretty stupid thing to do, but there are companies and individuals doing it. Fortunately, the Wikimedia Foundation is a bit more responsible, and is more sensitive to biases.

Using a LLM to generate content for Wikipedia is simply a bad idea. There are some tools out there (I wrote about Perplexity.ai recently) that do the legwork for citations, but with Wikipedia, not all citations are necessarily on the Internet. Some are in books, those dusty tomes where we have passed down knowledge over the centuries, and so it takes humans to be able to not just find those citations, but assess them and assure that other citations of other perspectives are involved1.

As they mention in the article, first drafts are not a bad idea, but they’re also not a great idea. If you’re not vested enough in a topic to do the actual reading, should you really be editing a community encyclopedia? I don’t think so. Research is an important part of any accumulation of knowledge, and LLMs aren’t even good shortcuts, probably because the companies behind them took shortcuts.

The Output of Wikipedia.

I’m a little shocked that Wikipedia might not have been scraped by the companies that own LLMs, considering just how much they scraped and from whom. Wikipedia, to me, would have been one of the first things to scrape to build the learning model, as would have been Project Gutenberg. Now that they’ve had the leash yanked, maybe they’re asking for permission now, but it seems peculiar that they would not have scraped that content in the first place.

Yet, unlike companies that simply cash in on the work of volunteers, like Huffington Post, StackOverflow, and so on, Wikimedia has a higher calling – and cashing in on volunteer works would likely cause less volunteers. Any sort of volunteer does so for their own reasons, but in an organization they collectively work toward something. The Creative Commons Licensing Wikipedia has requires attribution, and LLMs don’t attribute anything. I can’t even get ChatGPT to tell me how many books it’s ‘read’.

What makes this simple is that if all the volunteer work from Wikipedia is shoved into the intake manifold of a LLM, and that LLM is subscription based, and volunteers would have to pay to use it, it’s a non-starter.

We All Like The Idea of an AI.

Generally speaking, the idea of an AI being useful for so many things is seductive, from Star Trek to Star Wars. I wouldn’t mind an Astromech droid, but where science fiction meets reality, we are stuck with the informational economy and infrastructure we have inherited over the centuries. Certainly, it needs to be adapted, but there are practical things that need to be considered outside of the bubbles that a few billionaires seem to live in.

Taking the works of volunteers and works from the public domain2 to turn around and sell them sounds Disney in nature, yet Mickey Mouse’s fingerprints on the Copyright Act have helped push back legally on the claims of copyright. Somewhere, there is a very confused mouse.

  1. Honestly, I’d love a job like that, buried in books. ↩︎
  2. Disney started off by taking public domain works and copyrighting their renditions of them, which was fine, but then they made sure no one else could do it – thus the ‘fingerprints’. ↩︎

Free Book Resources (Digital, Audio).

Often in conversation or on social media, I’ll end up pointing out places where you can get books legally and at no cost. Today’s as good a day as any to point people at them.

First and foremost, I have to point to Project Gutenberg, which has been around over 50 years (before the Internet), and provides books in the public domain. You can read the books online (HTML), or in readers (Kindle & EPUB versions). I’ve been using Project Gutenberg since the early 2000s, and have found some great books in there for research or fun (which admittedly are the same for me).

To get an idea of what Gutenberg has – it can be intimidating – take a look at what has been downloaded most in a day.

Next – Wikibooks! These are over 3500 open content textbooks available through the site. I’ll often head over there when I’m researching a topic.

FreeComputerBooks.com is exactly what it says, if you’re looking for online books related to many aspects of computing – I don’t use them too much, but I browse now and then.

Lastly for this post, I’ll point you at OpenCulture.com’s list of free audiobooks. I’m not someone who appreciated books in audio form too much (or podcasts for that matter), so I haven’t used them – but the list includes many genres. The science fiction genre popped out immediately for me.

Happy reading!