Beyond A Widowed Voice.

By now, the news that Scarlett Johansson’s issues with OpenAI and the voice that sounds like her have made the rounds. She’s well known and regardless of one’s interests, she’s likely to pop up in various contexts. However, she’s not the first.

While different in some ways, voice actors Paul Skye Lehrman and Linnea Sage are suing Lovo for similar reasons. They got hired to do some work that they thought were one off voice overs, then heard their voices saying things they had never said. To the point, they heard their voices doing something that they didn’t get paid for.

The way they found out was oddly poetic.

Last summer, as they drove to a doctor’s appointment near their home in Manhattan, Paul Skye Lehrman and Linnea Sage listened to a podcast about the rise of artificial intelligence and the threat it posed to the livelihoods of writers, actors and other entertainment professionals.

The topic was particularly important to the young married couple. They made their living as voice actors, and A.I. technologies were beginning to generate voices that sounded like the real thing.

But the podcast had an unexpected twist. To underline the threat from A.I., the host conducted a lengthy interview with a talking chatbot named Poe. It sounded just like Mr. Lehrman.

“He was interviewing my voice about the dangers of A.I. and the harms it might have on the entertainment industry,” Mr. Lehrman said. “We pulled the car over and sat there in absolute disbelief, trying to figure out what just happened and what we should do.”

What Do You Do When A.I. Takes Your Voice?, Cade Metz, New York Times, May 16th, 2024.

They aren’t sex symbols like Scarlett Johansson. They weren’t the highest paid actresses in 2018 and 2019. They aren’t seen in major films. Their problem is just as real, just as audible, but not quite as visible. Forbes covered the problems voice actors faced in October of 2023.

…Clark, who has voiced more than 100 video game characters and dozens of commercials, said she interpreted the video as a joke, but was concerned her client might see it and think she had participated in it — which could be a violation of her contract, she said.

“Not only can this get us into a lot of trouble if people think we said [these things], but it’s also, frankly, very violating to hear yourself speak when it isn’t really you,” she wrote in an email to ElevenLabs that was reviewed by Forbes. She asked the startup to take down the uploaded audio clip and prevent future cloning of her voice, but the company said it hadn’t determined that the clip was made with its technology. It said it would only take immediate action if the clip was “hate speech or defamatory,” and stated it wasn’t responsible for any violation of copyright. The company never followed up or took any action.

“It sucks that we have no personal ownership of our voices. All we can do is kind of wag our finger at the situation,” Clark told Forbes

Keep Your Paws Off My Voice’: Voice Actors Worry Generative AI Will Steal Their Livelihoods, Rashi Shrivastava, Forbes.com, October 9th, 2023.

As you can see – the whole issue is not new. It just became more famous because of a more famous face, and involves OpenAI, a company that has more questions about their training data than ChatGPT can answer, so the story has sung from rooftops.

Meanwhile, some are trying to license the voices of dead actors.

Sony recently warned AI companies about unauthorized use of the content they own, but when one’s content is necessarily public, how do you do that?

How much of what you post, from writing to pictures to voices in podcasts and family videos, can you control? It costs nothing, but it costs futures of individuals. And when it comes to training models, these AI companies are eroding the very trust they need from those that they want to sell their product to – unless they’re just enabling talentless and incapable hacks to take over jobs that talented and capable people have already do.

We have more questions than answers, and the trust erodes as more and more people are impacted.

When The Internet Eats Itself

The recent news of Stack Overflow selling it’s content to OpenAI was something I expected. It was a matter of time. Users of Stack Overflow were surprised, which I am surprised by, and upset, which I’m not surprised by.

That seems to me a reasonable response. Who wouldn’t? Yet when we contribute to websites for free on the Internet and it’s not our website, it’s always a terrible bargain. You give of yourself for whatever reason – fame, prestige, or just sincerely enjoying helping, and it gets traded into cash by someone else.

But companies don’t want you to get wise. They want you to give them your content for free so that they can tie a bow around it and sell it. You might get a nice “Thank you!” email, or little awards of no value.

No Good Code Goes Unpunished.

The fallout has been disappointing. People have tried logging in and sabotaging their top answers. I spoke to one guy on Mastodon a few days ago and he got banned. It seems pretty obvious to me that they had already backed up the database where all the stuff was, and that they would be keeping an eye on stuff. Software developers should know that. There was also some confusion about the Creative Commons licensing the site uses versus the rights given to the owners of the website, which are mutually exclusive.

Is it slimy? You bet. It’s not new, and the companies training generative AI have been pretty slimy. The problem isn’t generative AI, it’s the way the companies decide to do business by eroding trust with the very market for their product while poisoning wells that they can no longer drink from. If you’re contributing answers for free that will be used to train AI to give the same answers for a subscription, you’re a silly person1.

These days, generative AI companies need to put filters on the front of their learning models to keep small children from getting sucked in.

Remember Huffington Post?

Huffington Post had this neat little algorithm for swapping around headlines til it found one that people liked, it gamed SEO, and it built itself into a powerhouse that almost no one remembers now. It was social, it was quirky, and it was fun. Volunteers put up lots of great content.

When Huffingpost sold for $315 million, the volunteers who provided the content for free and built the site up before it sold sued – and got nothing. Why? Because they had volunteered their work.

I knew a professional journalist who was building up her portfolio and added some real value – I met her at a conference in Chicago probably a few months before the sale, and I asked her why she was contributing to HuffPost for free. She said it was a good outlet to get some things out – and she was right. When it sold, she was angry. She felt betrayed, and rightfully so I think.

It seems people weren’t paying attention to that. I did2.

You live, you learn, and you don’t do it again. With firsthand and second hand experience, if I write on a website and I don’t get paid, it’s my website. Don’t trust anyone who says, “Contribute and good things will happen!”. Yeah, they might, but it’s unlikely it will happen for you.

If your content is good enough for a popular site, it’s good enough to get paid to be there. You in the LinkedIn section – pay attention.

Back To AI’s Intake Manifold.

I’ve written about companies with generative AI models scraping around looking for content, with contributed works to websites being a part of the training models. It’s their oil, it’s what keeps them burning through cash as they try to… replace the people whose content they use. In return, the Internet gets slop generated all over, and you’ll know the slop when you read it – it lacks soul and human connection, though it fakes it from time to time like the pornographic videos that make the inexperienced think that’s what sex is really like. Nope.

The question we should be asking is whether it’s worth putting anything on the Internet at this point, just to have it folded into a statistical algorithm that chews up our work and spits out something like it. Sure, there are copyright lawsuits happening. The argument of transformative works doesn’t really work that well in a sane mind when it comes to the exponentially higher amount of content used to create a generative AI at this point.

So what happens when less people contribute their own work? One thing is certain: the social aspect of the Internet will not thrive as well.

Social.

The Stack Overflow website was mainly an annoyance for me over the years, but I understand that many people had a thriving society of a sort there. It was largely a meritocracy, as open source, at least at it’s core. You’ll note that I’m writing of it in the past tense – I don’t think anyone with any bit of self-worth will contribute there anymore.

The annoyance aspect for me came from (1) Not finding solutions to the quirky problems that people hired me to solve3, and (2) Finding code fragments I tracked down to Stack Overflow poorly (if at all) adapted to the employer or client needs. I also had learned not to give away valuable things for free, so I didn’t get involved. Most, if not all, of the work I did required my silence on how things worked, and if you get on a site like StackOverflow – your keyboard might just get you in trouble. Yet the problem wasn’t the site itself, but those who borrowed code like it was a cup of sugar instead of a recipe.

Beyond we software engineers, developers, whatever they call themselves these days, there are a lot of websites with social interaction that are likely getting their content shoved into an AI learning model at some point. LinkedIn, owned by Microsoft, annoyingly in the top search results, is ripe for being used that way.

LinkedIn doesn’t pay for content, yet if you manage to get popular, you can make money off of sponsored posts. “Hey, say something nice about our company, here’s $x”. That’s not really social, but it’s how ‘influencers’ make money these days: sponsored posts. When you get paid to write posts in that way, you might be selling your soul unless you keep a good moral compass, but when bills need to get paid, that moral compass sometimes goes out the window. I won’t say everyone is like that, I will say it’s a danger and why I don’t care much about ‘influencers’.

In my mind, anyone who is an influencer is trying to sell me something, or has an ego so large that Zaphod Beeblebrox would be insanely jealous.

Regardless, to get popular, you have to contribute content. Who owns LinkedIn? Microsoft. Who is Microsoft partnered with? OpenAI. The dots are there. Maybe they’re not connected. Maybe they are.

Other websites are out there that are building on user content. The odds are good that they have more money for lawyers than you do, that their content licensing and user agreement work for them and not you, and if someone wants to buy that content for any reason… you’ll find out what users on Stack Overflow found out.

All relationships are built on trust. All networks are built on trust. The Internet is built on trust.

The Internet is eating itself.

  1. I am being kind. ↩︎
  2. I volunteered some stuff to WorldChanging.com way back when with the understanding it would be Creative Commons licensed. I went back and forth with Alex and Jamais, as did a few other contributors, and because of that and some nastiness related to the Alert Retrieval Cache, I walked away from the site to find out from an editor that contacted me about their book that they wanted to use some of my work. Nope. I don’t trust futurists, and maybe you shouldn’t either. ↩︎
  3. I always seemed to be the software engineer that could make sense out of gobblygook code, rein it in, take it to water and convince it to drink. ↩︎

Copyright, AI, And, It Doing It Ethically.

It’s no secret that the generative, sequacious artificial intelligences out there have copyright issues. I’ve written about it myself quite a bit.

It’s almost become cliche to mention copyright and AI in the same sentence, with Sam Altman having said that there would be no way to do generative AI without all that material – toward the end of this post, you’ll see that someone proved that wrong.

Copyright Wars pt. 2: AI vs the Public“, by Toni Aittoniemi in January of 2023, is a really good read on the problem as the large AI companies have sucked in content without permission. If an individual did it, the large companies doing it would call it ‘piracy’, but now, it’s… not? That’s crazy.

The timing of me finding Toni on Mastodon was perfect. Yesterday, I found a story on Wired that demonstrates some of what Toni wrote last year, where he posed a potential way to handle the legal dilemmas surrounding creator’s rights – we call it ‘copyright’ because someone was pretty unimaginative and pushed two words together for only one meaning.

In 2023, OpenAI told the UK parliament that it was “impossible” to train leading AI models without using copyrighted materials. It’s a popular stance in the AI world, where OpenAI and other leading players have used materials slurped up online to train the models powering chatbots and image generators, triggering a wave of lawsuits alleging copyright infringement.

Two announcements Wednesday offer evidence that large language models can in fact be trained without the permissionless use of copyrighted materials.

A group of researchers backed by the French government have released what is thought to be the largest AI training dataset composed entirely of text that is in the public domain. And the nonprofit Fairly Trained announced that it has awarded its first certification for a large language model built without copyright infringement, showing that technology like that behind ChatGPT can be built in a different way to the AI industry’s contentious norm.

“There’s no fundamental reason why someone couldn’t train an LLM fairly,” says Ed Newton-Rex, CEO of Fairly Trained. He founded the nonprofit in January 2024 after quitting his executive role at image-generation startup Stability AI because he disagreed with its policy of scraping content without permission….

Here’s Proof You Can Train an AI Model Without Slurping Copyrighted Content“, Kate Knibbs, Wired.com, March 20th, 2024

It struck me yesterday that a lot of us writing and communicating about the copyright issue didn’t address how it could be handled. It’s not that we don’t know that it couldn’t be handled, it’s just that we haven’t addressed it as much as we should. I went to sleep considering that and in the morning found that Toni had done much of the legwork.

What Toni wrote extends on the system:

…Any training database used to create any commercial AI model should be legally bound to contain an identity that can be linked to a real-world person if so required. This should extend to databases already used to train existing AI’s that do not yet have capabilities to report their sources. This works in two ways to better allow us to integrate AI in our legal frameworks: Firstly, we allow the judicial system to work it’s way with handling the human side of the equation instead of concentrating on mere technological tidbits. Secondly, a requirement of openness will guarantee researches to identify and question the providers of these technologies on issues of equality, fairness or bias in the training data. Creation of new judicial experts at this field will certainly be required from the public sector…

“Copyright Wars pt. 2: AI vs the Public”, Toni Aittoniemi, Gimulnautti, January 13th, 2023.

This is sort of like – and it’s my interpretation – a tokenized citation system built into a system. This would expand on what, as an example, Perplexity AI does by allowing style and ideas to have provenance.

This is some great food for thought for the weekend.

From Inputs to The Big Picture: An AI Roundup

This started off as a baseline post regarding generative artificial intelligence and it’s aspects and grew fairly long because even as I was writing it, information was coming out. It’s my intention to do a ’roundup’ like this highlighting different focuses as needed. Every bit of it is connected, but in social media postings things tend to be written of in silos. I’m attempting to integrate since the larger implications are hidden in these details, and will try to stay on top of it as things progress.

It’s long enough where it could have been several posts, but I wanted it all together at least once.

No AI was used in the writing, though some images have been generated by AI.

The two versions of artificial intelligence on the table right now – the marketed and the reality – have various problems that make it seem like we’re wrestling a mating orgy of cephalopods.

The marketing aspect is a constant distraction, feeding us what helps with stock prices and good will toward those implementing the generative AIs, while the real aspect of these generative AIs is not really being addressed in a cohesive way.

To simplify this, this post breaks it down into the Input, the Output, and the impacts on the ecosystem the generative AIs work in.

The Input.

There’s a lot that goes into these systems other than money and water. There’s the information used for the learning models, the hardware needed, and the algorithms used.

The Training Data.

The focus so far has been on what goes into their training data, and that has been an issue including lawsuits, and less obviously, trust of the involved companies.

…The race to lead A.I. has become a desperate hunt for the digital data needed to advance the technology. To obtain that data, tech companies including OpenAI, Google and Meta have cut corners, ignored corporate policies and debated bending the law, according to an examination by The New York Times…

How Tech Giants Cut Corners to Harvest Data for A.I.“, Cade MetzCecilia KangSheera FrenkelStuart A. Thompson and Nico Grant, New York Times, April 6, 2024 1

Of note, too, is that Google has been indexing AI generated books, which is what is called ‘synthetic data’ and has been warned against, but is something that companies are planning for or even doing already, consciously and unconsciously.

Where some of these actions are questionably legal, they’re not as questionably ethical to some, thus the revolt mentioned last year against AI companies using content without permission. It’s of questionable effect because no one seems to have insight into what the training data consists of, and there seems no one is auditing them.

There’s a need for that audit, if only to allow for trust.

…Industry and audit leaders must break from the pack and embrace the emerging skills needed for AI oversight. Those that fail to address AI’s cascading advancements, flaws, and complexities of design will likely find their organizations facing legal, regulatory, and investor scrutiny for a failure to anticipate and address advanced data-driven controls and guidelines.

Auditing AI: The emerging battlefield of transparency and assessment“, Mark Dangelo, Thomson Reuters, 25 Oct 2023.

While everyone is hunting down data, no one seems to be seriously working on oversight and audits, at least in a public way, though the United States is pushing for global regulations on artificial intelligence at the UN. The status of that hasn’t seemed to have been updated, even as artificial intelligence is being used to select targets in at least 2 wars right now (Ukraine and Gaza).

There’s an imbalance here that needs to be addressed. It would be sensible to have external auditing of learning data models and the sources, as well as the algorithms involved – and just get get a little ahead, also for the output. Of course, these sorts of things should be done with trading on stock markets as well, though that doesn’t seem to have made as much headway in all the time that has been happening either.

Some websites are trying to block AI crawlers, and it is an ongoing process. Blocking them requires knowing who they are and doesn’t guarantee bad actors might not stop by.

There is a new Bill that being pressed in the United States, the Generative AI Copyright Disclosure Act, that is worth keeping an eye on:

“…The California Democratic congressman Adam Schiff introduced the bill, the Generative AI Copyright Disclosure Act, which would require that AI companies submit any copyrighted works in their training datasets to the Register of Copyrights before releasing new generative AI systems, which create text, images, music or video in response to users’ prompts. The bill would need companies to file such documents at least 30 days before publicly debuting their AI tools, or face a financial penalty. Such datasets encompass billions of lines of text and images or millions of hours of music and movies…”

New bill would force AI companies to reveal use of copyrighted art“, Nick Robins-Early, TheGuardian.com, April 9th, 2024.

Given how much information is used by these companies already from Web 2.0 forward, through social media websites such as Facebook and Instagram (Meta), Twitter, and even search engines and advertising tracking, it’s pretty obvious that this would be in the training data as well.

The Algorithms.

The algorithms for generative AI are pretty much trade secrets at this point, but one has to wonder at why so much data is needed to feed the training models when better algorithms could require less. Consider a well read person could answer some questions, even as a layperson, with less of a carbon footprint. We have no insight into the algorithms either, which makes it seem as though these companies are simply throwing more hardware and data at the problem than being more efficient with the data and hardware that they already took.

There’s not much news about that, and it’s unlikely that we’ll see any. It does seem like fuzzy logic is playing a role, but it’s difficult to say to what extent, and given the nature of fuzzy logic, it’s hard to say whether it’s implementation is as good as it should be.

The Hardware

Generative AI has brought about an AI chip race between Microsoft, Meta, Google, and Nvidia, which definitely leaves smaller companies that can’t afford to compete in that arena at a disadvantage so great that it could be seen as impossible, at least at present.

The future holds quantum computing, which could make all of the present efforts obsolete, but no one seems interested in waiting around for that to happen. Instead, it’s full speed ahead with NVIDIA presently dominating the market for hardware for these AI companies.

The Output.

One of the larger topics that has seemed to have faded is regarding what was called by some as ‘hallucinations’ by generative AI. Strategic deception was also something that was very prominent for a short period.

There is criticism that the algorithms are making the spread of false information faster, and the US Department of Justice is stepping up efforts to go after the misuse of generative AI. This is dangerous ground, since algorithms are being sent out to hunt products of other algorithms, and the crossfire between doesn’t care too much about civilians.2

The impact on education, as students use generative AI, education itself has been disrupted. It is being portrayed as an overall good, which may simply be an acceptance that it’s not going away. It’s interesting to consider that the AI companies have taken more content than students could possibly get or afford in the educational system, which is something worth exploring.

Given that ChatGPT is presently 82% more persuasive than humans, likely because it has been trained on persuasive works (Input; Training Data), and since most content on the internet is marketing either products, services or ideas, that was predictable. While it’s hard to say how much content being put into training data feeds on our confirmation biases, it’s fair to say that at least some of it is. Then there are the other biases that the training data inherits through omission or selective writing of history.

There are a lot of problems, clearly, and much of it can be traced back to the training data, which even on a good day is as imperfect as our own imperfections, it can magnify, distort, or even be consciously influenced by good or bad actors.

And that’s what leads us to the Big Picture.

The Big Picture

…For the past year, a political fight has been raging around the world, mostly in the shadows, over how — and whether — to control AI. This new digital Great Game is a long way from over. Whoever wins will cement their dominance over Western rules for an era-defining technology. Once these rules are set, they will be almost impossible to rewrite…

Inside the shadowy global battle to tame the world’s most dangerous technology“, Mark Scott, Gian Volpicelli, Mohar Chatterjee, Vincent Manancourt, Clothilde Goujard and Brendan Bordelon, Politico.com, March 26th, 2024

What most people don’t realize is that the ‘game’ includes social media and the information it provides for training models, such as what is happening with TikTok in the United States now. There is a deeper battle, and just perusing content on social networks gives data to those building training models. Even WordPress.com, where this site is presently hosted, is selling data, though there is a way to unvolunteer one’s self.

Even the Fediverse is open to data being pulled for training models.

All of this, combined with the persuasiveness of generative AI that has given psychology pause, has democracies concerned about the influence. A recent example is Grok, Twitter X’s AI for paid subscribers, fell victim to what was clearly satire and caused a panic – which should also have us wondering about how we view intelligence.

…The headline available to Grok subscribers on Monday read, “Sun’s Odd Behavior: Experts Baffled.” And it went on to explain that the sun had been, “behaving unusually, sparking widespread concern and confusion among the general public.”…

Elon Musk’s Grok Creates Bizarre Fake News About the Solar Eclipse Thanks to Jokes on X“, Matt Novak, Gizmodo, 8 April 2024

Of course, some levity is involved in that one whereas Grok posting that Iran had struck Tel Aviv (Israel) with missiles seems dangerous, particularly when posted to the front page of Twitter X. It shows the dangers of fake news with AI, deepening concerns related to social media and AI and should be making us ask the question about why billionaires involved in artificial intelligence wield the influence that they do. How much of that is generated? We have an idea how much it is lobbied for.

Meanwhile, Facebook has been spamming users and has been restricting accounts without demonstrating a cause. If there were a video tape in a Blockbuster on this, it would be titled, “Algorithms Gone Wild!”.

Journalism is also impacted by AI, though real journalists tend to be rigorous in their sources. Real newsrooms have rules, and while we don’t have that much insight into how AI is being used in newsrooms, it stands to reason that if a newsroom is to be a trusted source, they will go out of their way to make sure that they are: They have a vested interest in getting things right. This has not stopped some websites parading as trusted sources disseminating untrustworthy information because, even in Web 2.0 when the world had an opportunity to discuss such things at the World Summit on Information Society, the country with the largest web presence did not participate much, if at all, at a government level.

Then we have the thing that concerns the most people: their lives. Jon Stewart even did a Daily Show on it, which is worth watching, because people are worried about generative AI taking their jobs with good reason. Even as the Davids of AI3 square off for your market-share, layoffs have been happening in tech as they reposition for AI.

Meanwhile, AI is also apparently being used as a cover for some outsourcing:

Your automated cashier isn’t an AI, just someone in India. Amazon made headlines this week for rolling back its “Just Walk Out” checkout system, where customers could simply grab their in-store purchases and leave while a “generative AI” tallied up their receipt. As reported by The Information, however, the system wasn’t as automated as it seemed. Amazon merely relied on Indian workers reviewing store surveillance camera footage to produce an itemized list of purchases. Instead of saving money on cashiers or training better systems, costs escalated and the promise of a fully technical solution was even further away…

Don’t Be Fooled: Much “AI” is Just Outsourcing, Redux“, Janet Vertesi, TechPolicy.com, Apr 4, 2024

Maybe AI is creating jobs in India by proxy. It’s easy to blame problems on AI, too, which is a larger problem because the world often looks for something to blame and having an automated scapegoat certainly muddies the waters.

And the waters of The Big Picture of AI are muddied indeed – perhaps partly by design. After all, those involved are making money, they have now even better tools to influence markets, populations, and you.

In a world that seems to be running a deficit when it comes to trust, the tools we’re creating seem to be increasing rather than decreasing that deficit at an exponential pace.

  1. The full article at the New York Times is worth expending one of your free articles, if you’re not a subscriber. It gets into a lot of specifics, and is really a treasure chest of a snapshot of what companies such as Google, Meta and OpenAI have been up to and have released as plans so far. ↩︎
  2. That’s not just a metaphor, as the Israeli use of Lavender (AI) has been outed recently. ↩︎
  3. Not the Goliaths. David was the one with newer technology: The sling. ↩︎

WordPress.com, Tumblr to Sell Information For AI Training: What You can do.

I accidentally posted this on RealityFragments.com, but I think it’s important enough to leave it there. The audiences vary, but both have other bloggers on them.

While I was figuring out how to be human in 2024, I missed that Tumblr and WordPress posts will reportedly be used for OpenAI and Midjourney training.

This could be a big deal for people who take the trouble to write their own content rather than filling the web with Generative AI text to just spam out posts.

If you’re involved with WordPress.org, it doesn’t apply to you.

WordPress.com has an option to use Tumblr as well, so when you post to WordPress.com it automagically posts to Tumblr. Therefore you might have to visit both of the posts below and adjust your settings if you don’t want your content to be used in training models.

This doesn’t mean that they haven’t already sent information to Midjourney and OpenAI yet. We don’t really know, but from the moment you change your settings…

  • WordPress.com: How to opt out of the AI training is available here.

    It boils down to this part in your blog settings on WordPress.com:


  • With Tumblr.com, you should check out this post. Tumblr is more tricky, and the link text is pretty small around the images – what you need to remember is after you select your blog on the left sidebar, you need to use the ‘Blog Settings’ link on the right sidebar.

Hot Take.

When I was looking into all of this, it ends up that Automattic, the owners of WordPress.com and Tumblr.com is doing the sale.

If you look at your settings, if you haven’t changed them yet, you’ll see that the default was set to allowing the use of content for training models. The average person who uses these sites to post their content are likely unaware, and in my opinion if they wanted to do this the right way the default setting would be to have these settings opt out.

It’s unclear whether they already sent posts. I’m sure that there’s an army of lawyers who will point out that they did post it in places and that the onus was on users to stay informed. It’s rare for me to use the word ‘shitty’ on KnowProSE.com, but I think it’s probably the best way to describe how this happened.

It was shitty of them to set it up like this. See? It works.

Now some people may not care. They may not be paying users, or they just don’t care, and that’s fine. Personal data? Well, let’s hope that got scrubbed.

Some of us do. I don’t know how many, so I can’t say a lot or a few. Yet if Automattic, the parent company of both Tumblr and WordPress.com, will post that they care about user choices, it hardly seems appropriate that the default choice was not to opt out.

As a paying user of WordPress.com, I think it’s shitty to think I would allow the use of what I write, using my own brain, to be used for a training model that the company gets paid for. I don’t see any of that money. To add injury to that insult of my intelligence, Midjourney and ChatGPT also have subscription to offer the trained AI which I also pay for (ChatGPT).

To make matters worse, we sort of have to take the training models on the word of those that use them. They don’t tell us what’s in them or where the content came from.

This is my opinion. It may not suit your needs, and if you don’t have a pleasant day. But if you agree with this, go ahead, make sure your blog is not allowing third party data sharing.

Personally, I’m unsurprised at how poorly this has been handled. Just follow some of the links early on in the post and revel in dismay.

Copyright, Innovation, and the Jailbreak of the Mouse.

Not produced by Disney, generated by deepai.

On one hand, we have the jailbreak of Steamboat Willie into the public domain despite the best efforts of Disney. I’m not worried about it either way; I generated the image using Deepai. If Disney is upset about it, I have no problem taking it down.

There’s a great write-up on the 1928 version of Mickey over at the Duke Center for the Study of the Public Domain, and you can see what you can do with the character and not through some of the links there.

So we have that aspect, where the Mickey Mouse Protection Act in 1998 allowed for the copyright protection further. As Lessig pointed out in Free Culture, much of the Disney franchise was built on the public domain where they copyrighted their own versions of works already in the public domain.

Personally, it doesn’t matter too much to me. I’ve never been a fan of Mickey Mouse, I’m not a big fan of Disney, and I have read much of the original works that Disney built off of and I like them better. You can find most of them at Gutenberg.org.

In other news, OpenAI has admitted that it can’t train it’s AI’s without copyrighted works.

Arguably, if there was more content in the public domain, OpenAI could train it’s AIs on stuff that is in the public domain. Then there’s the creative commons licensed content that could also be used but… well, that’s inconvenient.

So on one hand, we have a corporation making sure people don’t overstep with using Mickey of the Public Domain, which has happened, and on the other hand we have a corporation complaining that copyright is too restrictive.

On one hand, we have a corporation defending what it has under copyright (which people think went into the public domain but didn’t, just that version of Mickey), and on the other hand we have a corporation defending it’s wanton misuse of copyrighted materials.

Clearly something is not right with how we view copyright or innovation. Navigating that with lawyers seems like a disservice to everyone, but here we are.

The Ongoing Copyright Issue with Generative AI.

It’s a strange time. OpenAI (and Microsoft) are being sued by the New York Times and they’re claiming ‘Fair Use’ as if they’re having some coffee and discussing what they read in the New York Times, or are about to write a blog post about the entire published NYT archives, on demand.

It’s not just the New York Times, either. More and more authors are doing the same, or started before NYT.

IEEE’s article, “Generative AI has a Visual Plagiarism problem” demonstrates issues that back up the copyright claims. This is not regurgitation, this is not fair use – there is line by line text from the New York Times, amongst other things.

As I noted yesterday, OpenAI is making deals now for content and only caught this morning that, ‘The New York Times, too, had been in conversations with OpenAI to establish a “high-value” partnership involving “real-time display” of its brand in ChatGPT, OpenAI’s AI-powered chatbot.‘.

Clearly, discussions didn’t work out. I was going to link the New York Times article on it, but it seems I’ve used up my freebies so I can’t actually read it right now unless I subscribe.1 At this end of things, as a simple human being, I’m subject to paywalls for content, but OpenAI hasn’t been. If I can’t read and cite an article from the New York Times for free, why should they be able to?

On the other hand, when I get content that originated from news sites like the New York Times, there is fair use happening. People transform what they have read and regurgitate it, some more intellligibly than others, much like an artificial intelligence, but there is at least etiquette – linking the source, at the least. This is not something OpenAI does. It doesn’t give credit. It just inhales large amounts of text, the algorithms decide on the best ways to spit them out to answer prompts. Like blogging, only faster, and like blogging, sometimes it just makes stuff up.

This is not unlike a well read person doing the same. Ideas, thoughts, even memes are experiences we draw upon. What makes these generative artificial intelligences different? Speed. They also consume a lot more water, apparently.

The line has to be drawn somewhere, and since OpenAI isn’t living up to the first part of it’s name and is not being transparent, people are left poking a black box to see if their copyright has been sucked in without permission, mention, or recompense.

That does seem a bit like unfair use. This is not to say that the copyright system couldn’t use an overhaul, but apparently so could how generative AIs get their content.

What is that ‘Open’ in OpenAI mean, anyway?

  1. They do seem to have a good deal right now, I did try to subscribe but it failed for some obscure reason. I’ll try again later. $10 for a year of the New York Times is a definite deal, if only they could process my payment this morning. ↩︎

How Much AI In Journalism?

Recently, I’ve been active in a group on Facebook that is supposed to be a polite space to debate things. News articles fly around, and the news articles we see these days from different sources carry their own biases because rather than just presenting facts, they present narratives and narratives require framing.

I wondered how much of these articles were generated by what we’re calling artificial intelligence these days. In researching, I can tell you I’m still wondering – but I have found some things that are of interest.

The New York Times Lawsuit.

It’s only fair to get this out of the way since it’s short and sweet.

Of course, in the news now is the lawsuit that the New York Times has brought against Microsoft and OpenAI, where speculation runs rampant either way. To their credit, through that link, the New York Times presented things in as unbiased way as possible. Everyone’s talking about that, but speculation on that only really impacts investors and share prices. It doesn’t help people like me as much, who write their own content as individuals.

In an odd twist though, and not too long after the announcement, OpenAI is offering to pay for licensing news articles (paywalled), which you can read more about here if you’re not paying TheInformation1.

Either way, that lawsuit is likely not going to help my content stay out of a learning model because I just don’t have the lawyers. Speculating on it doesn’t get me anywhere.

How Much AI is being used?

Statista only has one statistic they cite in the amount of artificial intelligence used in media and entertainment: 78 percent of U.S. adults think news articles created by AI is not a good thing.

The articles there go on and tell us about the present challenges, etc, but one word should stand out from that: foggy.

So how would it be used, if it is used? With nearly 50 ‘news’ websites as of May last year, almost a year ago, and with one news site even going so far as having an AI news anchor as of late last year, we should have questions.

Well, we don’t really know how many news agencies are using artificial intelligence or how. One would think disclosure would be the issue then.

The arguments against disclosure are pretty much summed up below (an extract from a larger well balanced article).

Against disclosure

One concern is that it could stifle innovation. If news organisations are required to disclose every time they use AI, they may be less likely to experiment with the technology.

Another is that disclosure could be confusing for consumers. Not everyone understands how AI works. Some people may be suspicious of AI-generated content. Requiring disclosure could make it more difficult for consumers to get the information they need.

Should the media tell you when they use AI to report the news? What consumers should know, Jo Adetunji, Editor, TheConversationUK, TheConversation.com, November 14, 2023.

On the surface, the arguments make sense.

In my opinion, placing innovation over trust, which is the actual argument being made by some with that argument, is abhorrent. To innovate, one needs that trust and if you want that trust, it seems to me that the trust has to be earned. This, given the present state of news outlets in their many shades of truth and bias might seem completely alien to some.

I do encourage people to read that entire article because the framing of it here doesn’t do the article justice and I’ve simply expressed an opinion on one side of the arguments presented.

How Is AI presently used?

Again, we really don’t know because of the disclosure issue, but in October of last year Twipe published 10 ways that journalists are using artificial intelligence. It points from the onset to Klara Indernach, a tag used by Express.de to note when an article is created with artificial intelligence tools.

Arist von Harpe is cited in the article for saying, “We do not highlight AI-aided articles. We’re only using [AI] as a tool. As with any tool, it’s always the person using it who is responsible for what comes out.” This seems a reasonable position, and puts the accountability on the humans related to it. I have yet to see artificial intelligences be thrown under the bus for an incorrect article, so we have that landmark to look for.

The rest of that article is pretty interesting and mentions fact checking, which is peculiar given the prevalence of hallucinations and even strategic deception, as well as image generation, etc.

We’ll never really know.

In the end, I imagine the use of any artificial intelligence in newsrooms is evolving even as I write this and will be evolving well beyond when you read this. In a few years, it may not be as much of a big deal, but now we’re finding failures in artificial intelligences all the way to a court, in a matter that is simply fraught with political consequences. They were quick to throw Google Bard under the bus on that one.

It is still disturbing we don’t have much insight into the learning models being used, which is a consistent problem. The lawsuit of the New York Times seems to be somewhat helpful there.

I honestly tried to find out what I could here and in doing so came up with my own conclusion that wasn’t what I would have expected it to be.

In the end, it is as Arist von Harpe is cited. We have to judge based on the stories we get because every newsroom will do things differently. It would have helped if we had less room to speculate on biases before the creation of these artificial intelligence tools, and whoever screws up should lose some trust. In this day and age, though, feeding cognitive biases seems to trump trust.

That’s probably the discussion we should have had some time ago.

  1. These paywalls are super-annoying for we mere mortals who do not have the deep pockets of corporate America. How many subscriptions is a well informed person supposed to have? It’s gotten ridiculous. We’ve known that business models for news have been in such trouble that a ‘news story’ has a more literal definition these days, but… surely we can do better than this? ↩︎

The Quiet Misery of Content Mediators: Sama.

When I first read about content moderators spoke of psychological trauma in moderating Big Tech’s content for training models 2 weeks ago, I waited for the other shoe to drop. Instead, aside from a BBC mention related to Facebook, the whole thing seems to have dropped off the radar of the media.

The images pop up in Mophat Okinyi’s mind when he’s alone, or when he’s about to sleep.

Okinyi, a former content moderator for Open AI’s ChatGPT in Nairobi, Kenya, is one of four people in that role who have filed a petition to the Kenyan government calling for an investigation into what they describe as exploitative conditions for contractors reviewing the content that powers artificial intelligence programs.

“It has really damaged my mental health,” said Okinyi.

The 27-year-old said he would would view up to 700 text passages a day, many depicting graphic sexual violence. He recalls he started avoiding people after having read texts about rapists and found himself projecting paranoid narratives on to people around him. Then last year, his wife told him he was a changed man, and left. She was pregnant at the time. “I lost my family,” he said.

‘It’s destroyed me completely’: Kenyan moderators decry toll of training of AI models“, Niamh Rowe, The Guardian, August 2nd, 2023.

I expected more on this because it’s… well, it’s terrible to consider, especially for $1.46 and $3.74 an hour through Sama. Sama is a data annotation services company headquartered in California that employs content moderators around the world. As their homepage says, “25% of Fortune 50 companies trust Sama to help them deliver industry-leading ML models”.

Thus, this should be a bigger story, I think, but since it’s happening outside of the United States and Europe, it probably doesn’t score big with the larger media houses. The BBC differs a little in that regard.

A firm which was contracted to moderate Facebook posts in East Africa has said with hindsight it should not have taken on the job.

Former Kenya-based employees of Sama – an outsourcing company – have said they were traumatised by exposure to graphic posts.

Some are now taking legal cases against the firm through the Kenyan courts.

Chief executive Wendy Gonzalez said Sama would no longer take work involving moderating harmful content.

Firm regrets taking Facebook moderation work“, Chris Vallance, BBC News, August 15th 2023.

The CEO of Sama says that they won’t be taking further work related to harmful content. The question then becomes whether something is harmful content or not, so there’s no doubt in my mind that Sama is in a difficult position itself. She points out that Sama has ‘lifted 65,000 people out of poverty’.

Of course, global poverty is decreasing while economic disparity is increasing – something that keeps being forgotten and says much about how the measurement of global poverty is paralyzed while the rest of the world moves on.

The BBC article also mentions the OpenAI issue mentioned in The Guardian article mentioned above.

We have global poverty, economic disparity, big tech and the dirty underbelly of AI training models and social media moderation…

This is something we should all be following up on, I think. It seems like ‘lifting people out of global poverty’ is big business, in it’s own way, too, and that is just a little bit disturbing.

Lawsuit Regarding ChatGPT

Anonymous individuals are claiming that ChatGPT stole ‘vast amounts of data’ in what they hope to become a class action lawsuit. It’s a nebulous claim about the nebulous data that OpenAI has used to train ChatGPT.

…“Despite established protocols for the purchase and use of personal information, Defendants took a different approach: theft,” they allege. The company’s popular chatbot program ChatGPT and other products are trained on private information taken from what the plaintiffs described as hundreds of millions of internet users, including children, without their permission.

Microsoft Corp., which plans to invest a reported $13 billion in OpenAI, was also named as a defendant…”

Creator of buzzy ChatGPT is sued for vacuuming up ‘vast amounts’ of private data to win the ‘A.I. arms race’“, Fortune.com, Teresa Xie, Isaiah Poritz and Bloomberg, June 28th 2023.

I’ve had suspicions myself about where their training data came from, but with no insight into the training model, how is anyone to know? That’s what makes this case interesting.

“…Misappropriating personal data on a vast scale to win an “AI arms race,” OpenAI illegally accesses private information from individuals’ interactions with its products and from applications that have integrated ChatGPT, the plaintiffs claim. Such integrations allow the company to gather image and location data from Snapchat, music preferences on Spotify, financial information from Stripe and private conversations on Slack and Microsoft Teams, according to the suit.”…Misappropriating personal data on a vast scale to win an “AI arms race,” OpenAI illegally accesses private information from individuals’ interactions with its products and from applications that have integrated ChatGPT, the plaintiffs claim. Such integrations allow the company to gather image and location data from Snapchat, music preferences on Spotify, financial information from Stripe and private conversations on Slack and Microsoft Teams, according to the suit.

Chasing profits, OpenAI abandoned its original principle of advancing artificial intelligence “in the way that is most likely to benefit humanity as a whole,” the plaintiffs allege. The suit puts ChatGPT’s expected revenue for 2023 at $200 million…”

ibid (same article quoted above).

This would run contrary to what Sam Altman, CEO of OpenAI, put in writing before US Congress.

“…Our models are trained on a broad range of data that includes publicly available content,
licensed content, and content generated by human reviewers.3 Creating these models requires
not just advanced algorithmic design and significant amounts of training data, but also
substantial computing infrastructure to train models and then operate them for millions of users…”

[Reference: 3 “Our Approach to AI Safety.” OpenAI, 5 Apr. 2023, https://openai.com/blog/our-approach-to-ai-safety.]

Written Testimony of Sam Altman Chief Executive Officer OpenAI Before the U.S. Senate Committee on the Judiciary Subcommittee on Privacy, Technology, & the Law“, Senate.Gov, Sam Altman,CEO of OpenAI, 5-16-2023.

I would love to know who the anonymous plaintiffs are, and would love to know how they got enough information to make the allegations. I suppose we’ll find out more as this progresses.

I, for one, am curious where they got this training data from.