When You Can’t Trust Voices.

Generative AI is allowing people to do all sorts of things, including imitating voices we have come to respect and trust over the years. In the most recent case of Sir David Attenborough, he greatly objects to it and finds it ‘profoundly disturbing’.

His voice is being used in all manner of ways.

It wasn’t long ago that Scarlet Johannson suffered such an insult that was quickly ‘disappeared’.

The difference here is that a man who has spent decades showing people the natural world has his voice being used in disingenuous ways, and it should give us all pause. I use generative artificial intelligence, as do many others, but there would be no way that I would even consider misrepresenting what I write or work on in the voice of someone else.

Who would do that? Why? It dilutes it. Sure, it can be funny to have a narration by someone like Sir David Attenborough, or Morgan Friedman, or… all manner of people… but to trot out their voices to misrepresent truth is a very grey area in an era of half-truths and outright lies being distributed on the Babel of the Internet.

Somewhere – I believe it was in Lessig’s ‘Free Culture’ – I had read that the UK allowed artists to control how their works were used. A quick search turned this up:

The Copyright, Designs and Patents Act 1988, is the current UK copyright law. It gives the creators of literary, dramatic, musical and artistic works the right to control the ways in which their material may be used. The rights cover: Broadcast and public performance, copying, adapting, issuing, renting and lending copies to the public. In many cases, the creator will also have the right to be identified as the author and to object to distortions of his work.

The UK Copyright Service

It would seem that something similar would have to be done with the voices and even appearance of people around the world – yet in an age moving toward artificial intelligence, where content has been scraped without permission, the only people who can actually stop doing this are the ones who are scraping the content.

The world of trusted humans is being diluted by untrustworthy humans.

Wikipedia, and It’s Trouble with LLMs.

Wikipedia, a wonderful resource despite all the drama that comes with the accumulation of content, is having some trouble dealing with the the large language model (LLMs) AIs out there. There are two core problems – the input, and the output.

“…The current draft policy notes that anyone unfamiliar with the risks of large language models should avoid using them to create Wikipedia content, because it can open the Wikimedia Foundation up to libel suits and copyright violations—both of which the nonprofit gets protections from but the Wikipedia volunteers do not. These large language models also contain implicit biases, which often result in content skewed against marginalized and underrepresented groups of people

The community is also divided on whether large language models should be allowed to train on Wikipedia content. While open access is a cornerstone of Wikipedia’s design principles, some worry the unrestricted scraping of internet data allows AI companies like OpenAI to exploit the open web to create closed commercial datasets for their models. This is especially a problem if the Wikipedia content itself is AI-generated, creating a feedback loop of potentially biased information, if left unchecked…” 

AI Is Tearing Wikipedia Apart“, Claire Woodcock, Vice.com, May 2nd, 2023.

The Input into Wikipedia.

Inheriting the legal troubles of companies that built AI models by taking shortcuts seems like a pretty stupid thing to do, but there are companies and individuals doing it. Fortunately, the Wikimedia Foundation is a bit more responsible, and is more sensitive to biases.

Using a LLM to generate content for Wikipedia is simply a bad idea. There are some tools out there (I wrote about Perplexity.ai recently) that do the legwork for citations, but with Wikipedia, not all citations are necessarily on the Internet. Some are in books, those dusty tomes where we have passed down knowledge over the centuries, and so it takes humans to be able to not just find those citations, but assess them and assure that other citations of other perspectives are involved1.

As they mention in the article, first drafts are not a bad idea, but they’re also not a great idea. If you’re not vested enough in a topic to do the actual reading, should you really be editing a community encyclopedia? I don’t think so. Research is an important part of any accumulation of knowledge, and LLMs aren’t even good shortcuts, probably because the companies behind them took shortcuts.

The Output of Wikipedia.

I’m a little shocked that Wikipedia might not have been scraped by the companies that own LLMs, considering just how much they scraped and from whom. Wikipedia, to me, would have been one of the first things to scrape to build the learning model, as would have been Project Gutenberg. Now that they’ve had the leash yanked, maybe they’re asking for permission now, but it seems peculiar that they would not have scraped that content in the first place.

Yet, unlike companies that simply cash in on the work of volunteers, like Huffington Post, StackOverflow, and so on, Wikimedia has a higher calling – and cashing in on volunteer works would likely cause less volunteers. Any sort of volunteer does so for their own reasons, but in an organization they collectively work toward something. The Creative Commons Licensing Wikipedia has requires attribution, and LLMs don’t attribute anything. I can’t even get ChatGPT to tell me how many books it’s ‘read’.

What makes this simple is that if all the volunteer work from Wikipedia is shoved into the intake manifold of a LLM, and that LLM is subscription based, and volunteers would have to pay to use it, it’s a non-starter.

We All Like The Idea of an AI.

Generally speaking, the idea of an AI being useful for so many things is seductive, from Star Trek to Star Wars. I wouldn’t mind an Astromech droid, but where science fiction meets reality, we are stuck with the informational economy and infrastructure we have inherited over the centuries. Certainly, it needs to be adapted, but there are practical things that need to be considered outside of the bubbles that a few billionaires seem to live in.

Taking the works of volunteers and works from the public domain2 to turn around and sell them sounds Disney in nature, yet Mickey Mouse’s fingerprints on the Copyright Act have helped push back legally on the claims of copyright. Somewhere, there is a very confused mouse.

  1. Honestly, I’d love a job like that, buried in books. ↩︎
  2. Disney started off by taking public domain works and copyrighting their renditions of them, which was fine, but then they made sure no one else could do it – thus the ‘fingerprints’. ↩︎

DHS Artificial Intelligence Safety And Security Board Has Some Odd Appointments.

Now that we’ve seen that generative artificial intelligence can be trained ethically, without breaking copyright laws, the list of people to the DHS Artificial Intelligence Safety and Security Board seems less than ideal.

The Board is supposed to ‘advance AI’s responsible development and deployment’ (emphasis mine), yet some on that Board took shortcuts.

Shortcuts in relation to any national security issue seems like a bad thing.

Here’s the list.

There’s some dubious companies involved. The argument can be made – and it probably will – that the companies are a part of national infrastructure, but is it national infrastructure that controls the United States, or is it the other way around?

I don’t know that these picks are good or bad. I will say that there are some that, at least in the eyes of others, been irresponsible. That would fall under Demonstrated Unreliability.

Copyright, Innovation, and the Jailbreak of the Mouse.

Not produced by Disney, generated by deepai.

On one hand, we have the jailbreak of Steamboat Willie into the public domain despite the best efforts of Disney. I’m not worried about it either way; I generated the image using Deepai. If Disney is upset about it, I have no problem taking it down.

There’s a great write-up on the 1928 version of Mickey over at the Duke Center for the Study of the Public Domain, and you can see what you can do with the character and not through some of the links there.

So we have that aspect, where the Mickey Mouse Protection Act in 1998 allowed for the copyright protection further. As Lessig pointed out in Free Culture, much of the Disney franchise was built on the public domain where they copyrighted their own versions of works already in the public domain.

Personally, it doesn’t matter too much to me. I’ve never been a fan of Mickey Mouse, I’m not a big fan of Disney, and I have read much of the original works that Disney built off of and I like them better. You can find most of them at Gutenberg.org.

In other news, OpenAI has admitted that it can’t train it’s AI’s without copyrighted works.

Arguably, if there was more content in the public domain, OpenAI could train it’s AIs on stuff that is in the public domain. Then there’s the creative commons licensed content that could also be used but… well, that’s inconvenient.

So on one hand, we have a corporation making sure people don’t overstep with using Mickey of the Public Domain, which has happened, and on the other hand we have a corporation complaining that copyright is too restrictive.

On one hand, we have a corporation defending what it has under copyright (which people think went into the public domain but didn’t, just that version of Mickey), and on the other hand we have a corporation defending it’s wanton misuse of copyrighted materials.

Clearly something is not right with how we view copyright or innovation. Navigating that with lawyers seems like a disservice to everyone, but here we are.

The Ongoing Copyright Issue with Generative AI.

It’s a strange time. OpenAI (and Microsoft) are being sued by the New York Times and they’re claiming ‘Fair Use’ as if they’re having some coffee and discussing what they read in the New York Times, or are about to write a blog post about the entire published NYT archives, on demand.

It’s not just the New York Times, either. More and more authors are doing the same, or started before NYT.

IEEE’s article, “Generative AI has a Visual Plagiarism problem” demonstrates issues that back up the copyright claims. This is not regurgitation, this is not fair use – there is line by line text from the New York Times, amongst other things.

As I noted yesterday, OpenAI is making deals now for content and only caught this morning that, ‘The New York Times, too, had been in conversations with OpenAI to establish a “high-value” partnership involving “real-time display” of its brand in ChatGPT, OpenAI’s AI-powered chatbot.‘.

Clearly, discussions didn’t work out. I was going to link the New York Times article on it, but it seems I’ve used up my freebies so I can’t actually read it right now unless I subscribe.1 At this end of things, as a simple human being, I’m subject to paywalls for content, but OpenAI hasn’t been. If I can’t read and cite an article from the New York Times for free, why should they be able to?

On the other hand, when I get content that originated from news sites like the New York Times, there is fair use happening. People transform what they have read and regurgitate it, some more intellligibly than others, much like an artificial intelligence, but there is at least etiquette – linking the source, at the least. This is not something OpenAI does. It doesn’t give credit. It just inhales large amounts of text, the algorithms decide on the best ways to spit them out to answer prompts. Like blogging, only faster, and like blogging, sometimes it just makes stuff up.

This is not unlike a well read person doing the same. Ideas, thoughts, even memes are experiences we draw upon. What makes these generative artificial intelligences different? Speed. They also consume a lot more water, apparently.

The line has to be drawn somewhere, and since OpenAI isn’t living up to the first part of it’s name and is not being transparent, people are left poking a black box to see if their copyright has been sucked in without permission, mention, or recompense.

That does seem a bit like unfair use. This is not to say that the copyright system couldn’t use an overhaul, but apparently so could how generative AIs get their content.

What is that ‘Open’ in OpenAI mean, anyway?

  1. They do seem to have a good deal right now, I did try to subscribe but it failed for some obscure reason. I’ll try again later. $10 for a year of the New York Times is a definite deal, if only they could process my payment this morning. ↩︎

Strategic Deception, AI, and Investors.

‘Strategic deception’ in large language models is indeed a thing. It should be unsurprising. After all, people do it all the time when trying to give the answer that is wanted by the person asking the question.

Large Language Models are designed to… give the answer wanted by the person asking the question.

That there had to be a report on this is a little disturbing. It’s the nature of the Large Language Model algorithms.

Strategic deception is at the very least one form of AI Hallucination, which potentially reinforces biases that we might to think twice about. Like Arthur Juliani, I believe the term ‘hallucinate’ is misleading, and I believe we’re seeing a shift away from that. Good.

It’s also something I simply summarize as ‘bullshitting’. It is, after all, just statistics, but it’s statistics toward an end, which makes the statistics pliable enough for strategic deception.

It’s sort of like AI investors claiming ‘Fair Use’ when not paying for copyrighted materials in the large language models. If they truly believe that, it’s a strategic deception on themselves. If they wanted to find a way, they could, and they still may.

Lawsuit Regarding ChatGPT

Anonymous individuals are claiming that ChatGPT stole ‘vast amounts of data’ in what they hope to become a class action lawsuit. It’s a nebulous claim about the nebulous data that OpenAI has used to train ChatGPT.

…“Despite established protocols for the purchase and use of personal information, Defendants took a different approach: theft,” they allege. The company’s popular chatbot program ChatGPT and other products are trained on private information taken from what the plaintiffs described as hundreds of millions of internet users, including children, without their permission.

Microsoft Corp., which plans to invest a reported $13 billion in OpenAI, was also named as a defendant…”

Creator of buzzy ChatGPT is sued for vacuuming up ‘vast amounts’ of private data to win the ‘A.I. arms race’“, Fortune.com, Teresa Xie, Isaiah Poritz and Bloomberg, June 28th 2023.

I’ve had suspicions myself about where their training data came from, but with no insight into the training model, how is anyone to know? That’s what makes this case interesting.

“…Misappropriating personal data on a vast scale to win an “AI arms race,” OpenAI illegally accesses private information from individuals’ interactions with its products and from applications that have integrated ChatGPT, the plaintiffs claim. Such integrations allow the company to gather image and location data from Snapchat, music preferences on Spotify, financial information from Stripe and private conversations on Slack and Microsoft Teams, according to the suit.”…Misappropriating personal data on a vast scale to win an “AI arms race,” OpenAI illegally accesses private information from individuals’ interactions with its products and from applications that have integrated ChatGPT, the plaintiffs claim. Such integrations allow the company to gather image and location data from Snapchat, music preferences on Spotify, financial information from Stripe and private conversations on Slack and Microsoft Teams, according to the suit.

Chasing profits, OpenAI abandoned its original principle of advancing artificial intelligence “in the way that is most likely to benefit humanity as a whole,” the plaintiffs allege. The suit puts ChatGPT’s expected revenue for 2023 at $200 million…”

ibid (same article quoted above).

This would run contrary to what Sam Altman, CEO of OpenAI, put in writing before US Congress.

“…Our models are trained on a broad range of data that includes publicly available content,
licensed content, and content generated by human reviewers.3 Creating these models requires
not just advanced algorithmic design and significant amounts of training data, but also
substantial computing infrastructure to train models and then operate them for millions of users…”

[Reference: 3 “Our Approach to AI Safety.” OpenAI, 5 Apr. 2023, https://openai.com/blog/our-approach-to-ai-safety.]

Written Testimony of Sam Altman Chief Executive Officer OpenAI Before the U.S. Senate Committee on the Judiciary Subcommittee on Privacy, Technology, & the Law“, Senate.Gov, Sam Altman,CEO of OpenAI, 5-16-2023.

I would love to know who the anonymous plaintiffs are, and would love to know how they got enough information to make the allegations. I suppose we’ll find out more as this progresses.

I, for one, am curious where they got this training data from.

Whose Safe Space Is It Anyway?

Corporations have been creating “safe spaces” for themselves for a while, and while that can be read as either good or bad depending on how you feel about things, let’s just accept that as an objective truth.

Disney took things from the public domain and copyrighted their versions, making them as ubiquitous as their marketing – and then worked hard to close the door for others to do the same with their works which should have passed to the public domain.

The Sonny Bono Act, or Mickey Mouse Protection Act extended copyright to keep things from going into the public domain:

“…Following the Copyright Act of 1976, copyright would last for the life of the author plus 50 years (or the last surviving author), or 75 years from publication or 100 years from creation, whichever is shorter for a work of corporate authorship (works made for hire) and anonymous and pseudonymous works. The 1976 Act also increased the renewal term for works copyrighted before 1978 that had not already entered the public domain from 28 years to 47 years, giving a total term of 75 years.[3]

The 1998 Act extended these terms to life of the author plus 70 years and for works of corporate authorship to 120 years after creation or 95 years after publication, whichever end is earlier.[4] For works published before January 1, 1978, the 1998 act extended the renewal term from 47 years to 67 years, granting a total of 95 years.

This law effectively froze the advancement date of the public domain in the United States for works covered by the older fixed term copyright rules…”

Copyright Term Extension Act, Wikipedia, accessed on 16 May 2023.

Corporations acted in their own self-interest. Lawrence Lessig’s Free Culture was the first I read of it, but I don’t know that he was the first that noted it. They created a safe space for their copyrights while they had their roots in the public domain.

The world is full of other examples.

Bill Gates would dumpster dive and study code printouts, among other thing. As the famous founder of Microsoft, lots of people don’t seem to know that Microsoft didn’t start without understanding – and borrowing, if not buying – code from others. There’s nothing particularly shameful about it.

“The best way to prepare is to write programs, and to study great programs that other people have written. In my case, I went to the garbage cans at the Computer Science Center and I fished out listings of their operating systems.”

Bill Gates, Interview with Suzanne Lammers, 1986.

I think any programmer would disagree with the sentiment. Yet, the same Bill Gates who did that also wrote an open letter to hobbyists in 1976 that did not reflect that sentiment:

“…The feedback we have gotten from the hundreds of people who say they are using BASIC has all been positive. Two surprising things are apparent, however, 1) Most of these “users” never bought BASIC (less thank 10% of all Altair owners have bought BASIC), and 2) The amount of royalties we have received from sales to hobbyists makes the time spent on Altair BASIC worth less than $2 an hour.

Why is this? As the majority of hobbyists must be aware, most of you steal your software. Hardware must be paid for, but software is something to share. Who cares if the people who worked on it get paid?…”

An Open Letter To Hobbyists“, Bill Gates, cited in the New York Times archives.

Most people would say, “Well, he has a point.” And he did – in protecting a business model he was creating which kept people from being able to view the source code to learn from it. Was it a bad thing? A good thing? It doesn’t matter, it was a thing.

At the time, it was a bunch of scattered hobbyists before the Internet against a corporation that could afford lawyers and marketing. It was the wild, wild west of personal computing.

The above examples are 2 of many ‘negotiations‘ between the public and corporations, though with the increased political influence corporations have through lobbyism – and with money now being free speech – it’s hard to consider it a legitimate negotiation.

If you have 45 minutes, Pirates of Silicon Valley is worth watching.

The point is that corporations always do things like this, for better or worse and for better and worse. And with the emergence of artificial intelligence-like technologies, while the safe space of creators is being abstracted away into statistics. By extension, this also applies to the privacy of everyone’s data.

My thought is, the greater the wings, the more confident the bird should be where it roosts. If corporations are indeed made of individuals working toward common goals and are creating things, that’s great! But it should not come at the cost of competition, which is one of the founding principles of capitalism… which corporations tend to cite only when convenient.

“Do as we say. Not as we do.”

Is Output of ChatGPT Text a Derived Work?

artificial-intelligence-42c97bOne of the things that has bothered me most about ChatGPT is that it’s data was scraped from the Internet, where a fair amount of writing I have done resides. It would be hubris to think that what I wrote is so awesome that it could be ‘stealing’ from me, but it would also be idiotic to think that content ChatGPT produces isn’t derivative in a legal sense. In a world almost critically defined by self-preservation, I think we all should know where the line is. We don’t, really, but we should.

I’m no lawyer, but I’ve had my own ‘fun’ with copyright.

In fact, New Tech Observations from the UK (ntouk) seems to have caught ChatGPT lifting the plot of Alice in Wonderland without any attribution.  There are legal issues here that seem to have been ignored in most of the hype, where even reusing content from ChatGPT could be seen as contributing to the infringement.
That hasn’t really stopped anyone since most people don’t seem to take copyright seriously unless they work for an organization that takes copyright seriously, and even when they do take copyright seriously, it’s only within specific contexts. This is why I point out where I have used a large language model such as ChatGPT for anything, since I’m citing it citing nobody – and even then, I don’t use it for generating content other than some interesting images.

Entities with deep pockets are protected by their deep pockets, but the average person writing on the Internet has less deep pockets – and there are more of us. I’ve had content ‘borrowed’ without attribution. It can range from mildly amusing to outrage, particularly when some schmuck just borrowed to create a popular post without citation so that they could ‘produce’ content that they didn’t actually produce. And Copyright is implicit.

Privacy is a partner to Copyright as well. I’m wondering when the question will be raised about text scraped for these training models by some publishers that deal mainly with text rather than images – because the image lawsuits are happening.

For now, I suppose, don’t put anything online that you wouldn’t want anyone regurgitating without attribution.

3D Printers Disrupting Copyrights and Patents (2013)

3D-Printed Slide-TogetherIt wasn’t too long ago that DIY 3D printers started making the news in geek circles. It captured the imagination of some and was largely ignored by others. I was somewhere in between until a few days ago. I’ve since slid somewhere closer to imagination, perhaps because someone printed a rotor from a Wankel engine and posted it on Facebook.

This got me thinking that in a few years, I could quite literally have a custom Wankel engine printed. Factor in last month’s Scientific American article, “Information Is Driving a New Revolution in Manufacturing“, things get even more interesting. Still, it enters a new dimension when in the future, having a 3D printer may be as normal as it is to have a printer now. And we all know those ink cartridges are ripoffs.

It’s going to get really interesting. Already a 3D gun has been printed and test fired, creating a bit of panic with some. The government is acting predictably. This is hopefully not the highlight of 3D printing’s contribution to society; I’d have preferred not to mention it but there’s no getting around it. We’ll get past that speed bump or we’ll all die in a crossfire of 3D printed guns.

Back to printing an engine. Or parts for your car. Or parts for your computer. Then we get into the costs of the raw materials and an economic upheaval as people who fabricate things become less and less needed. What the printing press was to scribes 3D printing is to the majority of fabricators.

But wait. There’s more.

You know all those copyright and patent issues we have with technology and, yes, even agriculture (Monsanto vs. Farmer). What happens when 3D designs escape into the wild, as the 3D printed gun already has? How are companies going to control that?

They won’t. They’ll try, but in the long run they won’t.

It’s a strange new world coming. Some people are going to become very upset.

Image at top left courtesy Flickr user fdecomite, made available under this Creative Commons License.