Wikipedia, and It’s Trouble with LLMs.

Wikipedia, a wonderful resource despite all the drama that comes with the accumulation of content, is having some trouble dealing with the the large language model (LLMs) AIs out there. There are two core problems – the input, and the output.

“…The current draft policy notes that anyone unfamiliar with the risks of large language models should avoid using them to create Wikipedia content, because it can open the Wikimedia Foundation up to libel suits and copyright violations—both of which the nonprofit gets protections from but the Wikipedia volunteers do not. These large language models also contain implicit biases, which often result in content skewed against marginalized and underrepresented groups of people

The community is also divided on whether large language models should be allowed to train on Wikipedia content. While open access is a cornerstone of Wikipedia’s design principles, some worry the unrestricted scraping of internet data allows AI companies like OpenAI to exploit the open web to create closed commercial datasets for their models. This is especially a problem if the Wikipedia content itself is AI-generated, creating a feedback loop of potentially biased information, if left unchecked…” 

AI Is Tearing Wikipedia Apart“, Claire Woodcock, Vice.com, May 2nd, 2023.

The Input into Wikipedia.

Inheriting the legal troubles of companies that built AI models by taking shortcuts seems like a pretty stupid thing to do, but there are companies and individuals doing it. Fortunately, the Wikimedia Foundation is a bit more responsible, and is more sensitive to biases.

Using a LLM to generate content for Wikipedia is simply a bad idea. There are some tools out there (I wrote about Perplexity.ai recently) that do the legwork for citations, but with Wikipedia, not all citations are necessarily on the Internet. Some are in books, those dusty tomes where we have passed down knowledge over the centuries, and so it takes humans to be able to not just find those citations, but assess them and assure that other citations of other perspectives are involved1.

As they mention in the article, first drafts are not a bad idea, but they’re also not a great idea. If you’re not vested enough in a topic to do the actual reading, should you really be editing a community encyclopedia? I don’t think so. Research is an important part of any accumulation of knowledge, and LLMs aren’t even good shortcuts, probably because the companies behind them took shortcuts.

The Output of Wikipedia.

I’m a little shocked that Wikipedia might not have been scraped by the companies that own LLMs, considering just how much they scraped and from whom. Wikipedia, to me, would have been one of the first things to scrape to build the learning model, as would have been Project Gutenberg. Now that they’ve had the leash yanked, maybe they’re asking for permission now, but it seems peculiar that they would not have scraped that content in the first place.

Yet, unlike companies that simply cash in on the work of volunteers, like Huffington Post, StackOverflow, and so on, Wikimedia has a higher calling – and cashing in on volunteer works would likely cause less volunteers. Any sort of volunteer does so for their own reasons, but in an organization they collectively work toward something. The Creative Commons Licensing Wikipedia has requires attribution, and LLMs don’t attribute anything. I can’t even get ChatGPT to tell me how many books it’s ‘read’.

What makes this simple is that if all the volunteer work from Wikipedia is shoved into the intake manifold of a LLM, and that LLM is subscription based, and volunteers would have to pay to use it, it’s a non-starter.

We All Like The Idea of an AI.

Generally speaking, the idea of an AI being useful for so many things is seductive, from Star Trek to Star Wars. I wouldn’t mind an Astromech droid, but where science fiction meets reality, we are stuck with the informational economy and infrastructure we have inherited over the centuries. Certainly, it needs to be adapted, but there are practical things that need to be considered outside of the bubbles that a few billionaires seem to live in.

Taking the works of volunteers and works from the public domain2 to turn around and sell them sounds Disney in nature, yet Mickey Mouse’s fingerprints on the Copyright Act have helped push back legally on the claims of copyright. Somewhere, there is a very confused mouse.

  1. Honestly, I’d love a job like that, buried in books. ↩︎
  2. Disney started off by taking public domain works and copyrighting their renditions of them, which was fine, but then they made sure no one else could do it – thus the ‘fingerprints’. ↩︎

Wikipedia, AI, Oh My.

One of the most disruptive things that has happened during Web 2.0 is Wikipedia – displacing the Encyclopedia Britannica as an online resource, forging strategic partnerships, and – for better and worse – the editorial community.

It has become one of the more dependable sources of information on the Internet, and while imperfect, the editors have collectively been a part of an evolution of verification and quality control that has made Wikipedia a staple.

It apparently has also been part of the training models of the large language models that we have grown to know over the past months, such as ChatGPT and Google’s Bard, which is interesting given how much volunteer work went into creating Wikipedia – something that makes me wonder if Wikimedia could be a part of the lawsuit.

This is pure speculation on my part, but given how much collective effort has gone into the many projects of Wikimedia, and given it’s mission is pretty clear about bringing free educational content to the world, large language models charging subscribers on that content is something that might be worth a bit of thought.

On a conference call in March that focused on A.I.’s threats to Wikipedia, as well as the potential benefits, the editors’ hopes contended with anxiety. While some participants seemed confident that generative A.I. tools would soon help expand Wikipedia’s articles and global reach, others worried about whether users would increasingly choose ChatGPT — fast, fluent, seemingly oracular — over a wonky entry from Wikipedia. A main concern among the editors was how Wikipedians could defend themselves from such a threatening technological interloper. And some worried about whether the digital realm had reached a point where their own organization — especially in its striving for accuracy and truthfulness — was being threatened by a type of intelligence that was both factually unreliable and hard to contain.

One conclusion from the conference call was clear enough: We want a world in which knowledge is created by humans. But is it already too late for that?

John Gertner, “Wikipedia’s Moment of Truth“, New York Times Magazine, July 18th, 2023, Updated on July 19th, 2023.

It is a quandary, that’s for sure. Speaking for myself, I prefer having citations on a Wikipedia page that I can research on my own – there seem to be at least some of us that trample our way through footnotes – and large language models don’t cite anything, which is the main problem I have with them.

In the facts category, I would say Wikipedia should win.

Unfortunately, time and again, the world has demonstrated that facts are sometimes a liability for selling a story, and so the concern I have is real.

Yet it could be useful to combine the two somehow.

Normalized Vice, AI-pedia?

_web_wikiWhen I read, “AI is tearing Wikipedia apart“, I immediately recalled all the personal issues I had with the never-to-return-because-I-said-so page on myself. It’s long and involved, but the short story is about dealing with some pretty different ways we all think of Wikipedia, and the different sects of volunteers involved. Yes, there are sects, and I had a run-in with the deletionist sect because of a profile I didn’t create, but some journalist had.
It’s not pretty when you let loose people organizing as much information on a volunteer basis. When Jimmy Wales and I shared the same geography, we planned to get coffee sometime and we were both too busy to do it. I mentioned this to him, and he rightly said something to the effect that it’s for them to deal with. It was personal for me (how can a Wikipedia page not be so?), and what I did influence were some new rules on dealing with biographies of living people.

But yes, Wikipedia using a large language model? The biases… well, that’s just a headache to discuss. I posted the article on my personal Facebook page, where I have a few friends who are editors at Wikipedia, and they didn’t bite. One person did, however, point out that Vice.com, the publisher’s of that article, is headed for bankruptcy.

It looks like the normalization of Web 2.0 coinciding with the new disruption of large language models reminds me of dominos toppling onto each other. That’s an interesting, and peculiar twist.

An ebb of disruption, a new wave of disruption. Much of tech isn’t about tech.