Is Output of ChatGPT Text a Derived Work?

artificial-intelligence-42c97bOne of the things that has bothered me most about ChatGPT is that it’s data was scraped from the Internet, where a fair amount of writing I have done resides. It would be hubris to think that what I wrote is so awesome that it could be ‘stealing’ from me, but it would also be idiotic to think that content ChatGPT produces isn’t derivative in a legal sense. In a world almost critically defined by self-preservation, I think we all should know where the line is. We don’t, really, but we should.

I’m no lawyer, but I’ve had my own ‘fun’ with copyright.

In fact, New Tech Observations from the UK (ntouk) seems to have caught ChatGPT lifting the plot of Alice in Wonderland without any attribution.  There are legal issues here that seem to have been ignored in most of the hype, where even reusing content from ChatGPT could be seen as contributing to the infringement.
That hasn’t really stopped anyone since most people don’t seem to take copyright seriously unless they work for an organization that takes copyright seriously, and even when they do take copyright seriously, it’s only within specific contexts. This is why I point out where I have used a large language model such as ChatGPT for anything, since I’m citing it citing nobody – and even then, I don’t use it for generating content other than some interesting images.

Entities with deep pockets are protected by their deep pockets, but the average person writing on the Internet has less deep pockets – and there are more of us. I’ve had content ‘borrowed’ without attribution. It can range from mildly amusing to outrage, particularly when some schmuck just borrowed to create a popular post without citation so that they could ‘produce’ content that they didn’t actually produce. And Copyright is implicit.

Privacy is a partner to Copyright as well. I’m wondering when the question will be raised about text scraped for these training models by some publishers that deal mainly with text rather than images – because the image lawsuits are happening.

For now, I suppose, don’t put anything online that you wouldn’t want anyone regurgitating without attribution.

False Confidence

JG Ballard QuoteWhen I’m out and about, I make connections with people I see regularly. Recently, I encountered a security guard I know who works regularly at a local mall, and he and I paused to chat. He generally has a ‘new and improved way’ to get out of his job, where he makes about the equivalent of $2 US an hour.

His new plan is cryptocurrency.

Internally, I boggled. Why would he be interested in cryptocurrency? I realize that he likely missed John Oliver’s recent factual mocking of cryptocurrencies, which is worth watching if you missed it.

I haven’t written about cryptocurrencies before, largely because I tend to like writing about things with intrinsic value and cryptocurrencies don’t. Yet people believe in them fervently at times. The technology is beyond the comprehension of the average person, yet the hype surrounding it, like NFTs (Non-fungible tokens) is really a bubble of artificial value.

I’m no financial expert. I’m not going to tell you how to get rich, much less get rich fast, though selling books on the topic seems like a niche market for people looking for a way ‘upward’. The financial value of something, regardless of what a ‘market price’ is, is really what someone else is willing to pay for it. Market values go up and down, but not everyone wants what people have. Ask anyone trying to sell a house, a car, or trying to get their adult children married off so they’ll get out of the house.

Why do things like cryptocurrencies make such inroads instead of being strangled at birth? I have a few ideas.

First, the present banking system treats you like a customer when in fact the bank is the customer most of the time. We stick money in, and they have the audacity to give us interest rates that are lower than the cost of living increases while charging us for holding the money we have effectively loaned them. They in turn loan out our money to other people for sometimes horrific interest rates while making lots of money for themselves and sticking you with less than 4% interest. These are not our friends.

Next, people generally want to claw their way up from where they are for various reasons. Social status, living more comfortably, or having that shiny red car that they think they can’t live without. Given that the financial sector is less than pleasant in these regards given that employment raises are generally lower than the cost of living increases, people look for quick and dirty ways ahead which includes cryptocurrencies but also includes crime and corruption.

Of course, many banks have said ‘cryptocurrency bad’ in various ways, but since people don’t trust banks as much as banks like to think they do, that has little effect. Governments? Same thing.

So we have people who basically throw money into casinos and lotteries in the hope that they will magically get ahead when no one else does. It is, as John Oliver pointed out, a confidence game.

Cryptocurrencies have demonstrated time and again that they’re not good gambles, much less investments. People sometimes argue that there needs to be regulation, but the banking industry’s regulation certainly hasn’t helped anyone either.

In a world that costs, where our individual efforts seem to have less and less value while corporations make more and more, people will always look at these things as ways to get ahead. A few lucky people might do well at the cost of many more that will lose.

Cryptocurrencies should be highlighting changes we need to deal with globally rather than get security guards speaking confidently about how they know how to move ahead using crypto…

The Complicated Publishing Issue.

_libraryMost of the people around me are completely unaware of the Internet Archive being successfully sued over sharing electronic books by some publishers:

…In July 2020, immediately after the Covid lockdown, four publishers – Hachette, HarperCollins, Wiley and Penguin Random House – decided to bring a major lawsuit against the Internet Archive, claiming it had ‘infringed their copyright’, potentially cost their companies millions of dollars and was a threat to their businesses. Last month the New York court found – predictably – in the publishers’ favour, rejecting the IA’s defence of ‘fair use’, and ruling that ‘although IA has the right to lend print books it lawfully acquired, it does not have the right to scan those books and lend the digital copies en masse.’…
The article goes on quite a bit exploring it in what seems to me to be a fairly comprehensive and balanced way. Even so, I looked around through the news about it and found a few other things.

There’s the Author’s Guild’s celebration of the success. That seems a bit more damning because the author’s aren’t the publishers, and they raise some valid points.

The Internet Archive’s own post on the matter brought up the public good:

Today’s lower court decision in Hachette v. Internet Archive is a blow to all libraries and the communities we serve. This decision impacts libraries across the US who rely on controlled digital lending to connect their patrons with books online. It hurts authors by saying that unfair licensing models are the only way their books can be read online. And it holds back access to information in the digital age, harming all readers, everywhere…

Having read all of this, I find that there are good points on either side. As far as the legalities of the specifics of the case, I am not a lawyer and do not pretend to be one on the Internet, so I can’t comment on that. I can say that as someone who reads a lot, even though I have gone back to paper books for the most part, these publishing models seem antiquated and have not allowed much room for the rights of people to access information, be it a romance novel or scientific papers. The big wheels have turned too slow on this.

I think the best article I read on the topic, the lawsuit regarding fair use, was by Marketplace:

…“The publishers believe that digital lending should essentially be a right that they license to libraries and that every time a library wants to loan something to a reader, the publishers should get paid a licensing fee,” Sinnreich told Marketplace.  

But licensing models can be burdensome for institutions that are largely underfunded. 

Public libraries use different licensing models, but the most common is the two-year license, explained Alan Inouye, leader of the American Library Association’s public policy and advocacy office…

…Librarians have chronicled journal price changes over the years, finding that some titles could cost between about $50 and $220 in the 1980s. Now, those same titles range between about $18,900 and $40,300. 

Inouye said he thinks both libraries and individuals have fewer rights in our digital environment…

There was a time that present generations may not remember where we lent friends books that we had. Given it was one physical copy, we could only share it once, and the same was true of libraries. If a book you wanted to read was checked out, you couldn’t get to it until it was physically returned. If a library had paid for more than one book, it could lend more than one book because of the physical limitations.

Now, with electronic books, it’s possible to share things a lot easier, but the intent of publishers is not for the books to be shared. The intent of public libraries is to share information for the public good. The intent of readers varies, but in the broad strokes it’s access to information, sometimes permanently (buying the book) and sometimes temporarily (borrowing the book from a library). The balance of all of this is at issue and has been for some time, and let’s be honest: Publishers have been making their own rules and lobbying their own legislation for some time. You can read about this in Lawrence Lessig’s “Free Culture”, which you can legally download as a PDF from the Library of Congress.

All of this is a centuries long negotiation between people and those that publish. Oddly, it has little to do with the content creators themselves other than the fact that they are beholden to publishers to publish their works… in an era when self-publishing is possible. In return, they get help producing, marketing and protecting those works.

And now, things are actually becoming more complicated with large language models.

That Other Linguistic Bias…

language barrier
Hands off my tags! Michael Gaida

 It seemed a bit strange to me to write about the bias in English when I have also been aware of the linguistic diversity of the Internet for some time. I didn’t shove that in because I was not up to date on the latest data regarding language and those connecting to the Internet. As luck would have it, I just found it here in the form of a spreadsheet, updated as of this month of this year. 
 
It shows promise. We went from 64% of humans connected to 67% in one year. More languages from the continent of Africa are represented. Information like this reveals an implicit bias that most people are not aware of – the invisible 33%. 
Our framing on the Internet tends to neglect them. We have a tendency to believe that everyone is connected. We’re not.

What’s more, that simple bit of information also demonstrates that training a large language model or an AI that leaves 33% of humanity out should give us pause. It won’t, but it should. 33% of humanity can’t access the Internet. Cultures and languages aren’t represented.
But technology waits for no one because tech companies wait for no one because they need us to keep buying technology.

Silent Bias

_web_studying ourselvesOnce upon a time as a Navy Corpsman in the former Naval Hospital in Orlando, we lost a patient for a period – we simply couldn’t find them. There was a search of the entire hospital. We eventually did find her but it wasn’t by brute force. It was by recognizing what she had come in for and guessing that she was on LSD. She was in a ladies room, staring into the mirror, studying herself through a sensory filter that she found mesmerizing. What she saw was something only she knows, but it’s safe to say it was a version of herself, distorted in a way only she would be able to explain.

I bring this up because as a species, many of us connected to our artificial nervous system are fiddling with ChatGPT, and what we are seeing are versions of our society in a mirror.

As readers, what we get out of it has a lot of what we bring to it. As we query it, we also get out of it what we ask of it through the filters of how it was trained and it’s algorithms, the reflexes we give it. Is it sentient? Of course not, these are just large language models and are not artificial general intelligences.

With social media companies, we have seen the effect of the social media echo chambers as groups become more and more isolated despite being more and more connected, aggregating to make it easier to sell advertising to. This is not to demonize them, many bloggers were doing it before them, and before bloggers there was the media, and before then as well. It might be amusing if we found out that cave paintings were actually advertising for someone’s spears or some hunting consulting service, or it might be depressing.

All of this cycled through my mind yesterday as I began considering the role of language itself with it’s inherent bias based on an article that stretched it to large language models and artificial intelligence. The actual study was just about English and showed a bias toward addition, but with ChatGPT and other large language models being the current advertising tropism, it’s easy to understand the intention of linking the two in an article.

Regardless of intention, there is a point as we stare into the societal mirror of large language models. The training data will vary, languages and cultures vary, and it’s not hard to imagine that every language, and every dialect, has some form of bias. It might be a good guess that where you see a lot of bureaucracy, there is linguistic bias and that can get into a chicken and egg conversation: Did the bias exist before the language, or did the language create the bias? Regardless, it can reinforce it.

fake hero dogThen I came across this humorous meme. It ends up being a legitimate thing that happened. The dog was rewarded with a steak for saving the life of a child from drowning and quickly came to the conclusion that pulling children out of water got it steak.

Apparently not enough children were falling into water for it to get steaks, so it helped things along. It happened in 1908, and Dr. Pavlov was still alive during this. His famous derived work with dogs was published in 1897, about 11 years prior, but given how slow news traveled then it wasn’t as common knowledge as we who have internet access would expect. It’s possible the New York Times article mentioned him, but I didn’t feel like unlocking their paywall.

If we take this back to society, we have seen the tyranny of fake news propagation. That’s nothing new either. What is interesting is the paywall aspect, where credible news is hidden behind paywalls leaving the majority of the planet to read what is available for free. This is a product of publishing adaptation to the Internet age, which I lived through and which to an extent I gained some insight into when I worked for Linux Journal’s parent company, SSC. The path from print to internet remains a very questionable area because of how advertising differs between the two media.

Are large language models being trained on paywalled information as well? Do they have access to academic papers that are paywalled? What do they have access to?

What parts of ourselves are we seeing through these mirrors? Then we have to ask whether the large language models have access to things that most humans don’t, and based on who is involved, it’s not hard to come to a conclusion where the data being fed to them by these companies isn’t available for consumption for the average person. Whether that is true or not is up for debate.

All of this is important to consider as we deal with these large language models, yet the average person plays with them as a novelty, unaware of the biases. How much should we trust what comes out of them?

As far as disruptive technologies go, this is probably the largest we have seen since the Internet itself. As long as it gives people what they want, and it supports cognitive biases, it’s less likely to be questioned. Completely false articles propagate on the Internet still, there are groups of people who seriously believe that the Earth is flat, and we have people asking ChatGPT things that they believe are important. I even saw someone in a Facebook reel quoting a GPT-4 answer.

We should at the least be concerned, but overall we aren’t. We’re too busy dealing with other things, chasing red dots.

Media Tourists

_Media TourismEvery morning I set aside my morning coffee to travel the world through my mobile phone as if it were a spaceship, hermetically sealed. I peer through the window as I travel from point to point I find of interest that morning. One moment I’m checking on my friends in Ukraine inobtrusively, keenly aware that there’s more spin than the media frenzy of a hurricane hitting the United States.

The next moment, I’ll visit friends and family throughout the world as I read their Facebook posts. Then I’ll look at what the talking heads of tech think is important enough to hype, then I’ll deep dive into a few of them to discreetly consider most of it nonsense. I am a weathered traveler of space and time, and the Internet my generation extended around the world has not been wasted on me. I am a tourist of tourists, as are we all. It’s what we humans do.

When people take those all inclusive vacations to resorts, they get to see a sanitized version of the country they are visiting that doesn’t reflect the reality of that country. You’ll hear people talking about when they visited this place or that, and how wonderful this or that was – you’ll rarely hear what was wrong with the country because… well, that would be bad for tourism, and tourism is about selling a dream to people who want to dream, much like politics, but with a more direct impact on revenue that politicians can waste.

Media, and by extension, social media, are much the same. We see what’s ‘in the frame’ of the snapshot we are giving, and that framing makes us draw conclusions about a place. A culture. A religion, or not. An event. A person.
Some of us believe that we’re seeing everything clearly, as in the image at the top of this post. You can look at any point in the picture and see detail, but that’s not how we really see it, and therefore, in our mind, it’s not the way it is. What we see is subject to the ‘red dots’ I wrote of, things looking for our attention directed consciously by someone else (marketing/advertising) and by subconsciously by our own biases.

_Media Tourism perspectiveThe reality of our experiences is usually more like something to the right. Our focus is drawn by red dots and biases, and in the periphery other things are there, poorly defined.  This example is purely visual. And because we generally like what we see, there’s generally a positive emotion with what we see that reinforces wanting to see it again.

This is not new, and it can be good and bad. These days an argument could be made that the red dots of society have run amok.

A group of really smart people with really good intentions created a system that connects human experiences across the planet in a way that is significantly faster than before. Some of our ancestors could not send a message around the world within their lifetime, and here are presently discussing milliseconds to nanoseconds as if we even would notice a millisecond passing ourselves. Our framing was simpler before, we didn’t have nearly as significant a communicating global network back then. Technologies that spread things faster range from the wheel to sailing to flight to the Internet, in broad strokes. As Douglas Adams would write, “Startling advances in twig technology…”

_venn-diagram-41217_960_720However we got here, here we are.

If one group has a blue focus, another purple, another yellow, we get overlaps in framing and the immediate effect has been for everyone to go off in their corners and discuss all that is blue, purple and yellow respectively.

An optimist might say that eventually, the groups will recognize the overlaps in the framing and maybe do a bit better at communicating, but it doesn’t seem like we’re very near that yet. A pessimist will say that it will never happen, and the realist will have to decide the way forward.

I’m of the opinion that it’s our duty as humans to work toward increasing the size of our frames as much as possible so that we have a better understanding of what’s going on within our frame. I don’t know that I’m right, and I don’t know that I’m wrong. If I cited history, the victories would be few that way – there’s always some domination that seems to happen. Personally, I don’t see any really dominant perspective, just a bunch of polarized frames throwing rocks at each other from a distance.

We’ll get so wrapped up in things that we forget sometimes that there’s room for more than one perspective, as difficult as it may be for people to understand. We’ll forget our small knowledge of someone else’s frame does not define their frame, but defines our frame. We forget that we’re just tourists of frames, we visit as long as we wish but do not actually live in a different frame.

Sounds messy? You bet. And all of that mess is being used to train large language models.  Could it homogenize things? Should it? I am fairly certain we’re not ready for that conversation, but like talking about puberty and sex with a teenager… we do seem a bit late on the subject.

I’m just a cybertourist visiting your prison, as you visit mine. Please don’t look under the carpet.

Captcha THAT.

childhood complex trauma_When I first started programming, I did a lot of walking. A few months ago I checked the distance I walked every day just back and forth to school and it was about 3.5 km, not counting being sent to the store, or running errands. At the same time, we had this IBM System 36 and a PC Network at school where space was limited, time was limited, and you didn’t have much time to be productive on the computer so you better have it locked down.

At that point, the language was BASIC. The popularity of object oriented programming had not blessed (and cursed) us yet, so we had line numbers on each line, handy for debugging because the most basic errors would tell you where you had a typo. There was an hour every few days to type assignments in so that you could get a grade, or maybe even do something of worth and understand what you were doing.

During that period, can you guess where I did most of my programming? When I was walking around seemingly aimlessly in parking lots, or staring at trees, or anything but staring at a computer monitor. Computers were not plentiful, the time on them was limited, you didn’t have time to screw around on a keyboard.

I have survived decades of programming since then. I still fiddle now and then, but after being beaten to market by Google on getting stuff out (“Set your sights high!”, they tell you…) I’m a bit tired of chasing those particular red dots. My absence from my desk was almost never found tolerable by at least someone who thought what they thought mattered more than results, but I got results. If you saw me typing frantically away at a keyboard, it wasn’t a spur of the moment thing. There was thought that went into crafting that code, there was planning and bullet proofing, to the point where as I became more senior I spent less time at the keyboard than many people in departments I worked in.

I mention all of this because software engineering has changed over the years. In my days, when we were learning we were not given answers from websites like Stack Overflow, we didn’t even have websites. If we were lucky we had the manual for the language, we had plausible typing skills and we had limited time on the machines.

This isn’t ‘walk uphill both ways’, this is, “We did this without all these cool toys you have now”. It’s not an issue of we had it harder, it’s a matter of we did it differently. We didn’t have editors that were forgiving, much less helpful. Within such a short window technology for programming has come a very long way, and it’s kind of cool – except all the silly Python editors and tools apparently written by the children of people who thought that “The Lord of the Rings” book trilogy was evil.

From the 1980s to now, it’s been a real whirlwind with way too much hype on way too many things that nobody recalls immediately. Then the captcha came along, to make sure ‘bots’ weren’t trying to do things, to check if a real human being was involved.

So humanity doubled down on that with large language models like ChatGPT. I guess kids stopped walking to school, they got more computers, and now they don’t even have to do their own homework.

I’m not sure where this is heading, but I’ll be making popcorn.

The Societal Mirror.

web_humanity in ai_individualThe article, “Introducing the AI Mirror Test, which very smart people keep failing“, hits some pretty good points on these large language models that we are lazily calling Artificial Intelligence. One of my favorites is this:

…What is important to remember is that chatbots are autocomplete tools. They’re systems trained on huge datasets of human text scraped from the web: on personal blogs, sci-fi short stories, forum discussions, movie reviews, social media diatribes, forgotten poems, antiquated textbooks, endless song lyrics, manifestos, journals, and more besides. These machines analyze this inventive, entertaining, motley aggregate and then try to recreate it. They are undeniably good at it and getting better, but mimicking speech does not make a computer sentient…

As I pointed out in a post on ChatGPT and large language models, such as ‘A Chat With GPT on AI‘, I recognized that it was meeting my cognitive bias. In that regard, I recognized some of myself in what I was getting back, not too different from when I was playing with Eliza in the 1980s with the only difference being that the bot has gotten better because it has access to more information than what the user types in. We were young, we dreamed, but tech wasn’t ready yet.
web_humanity in ai_group
Of course it’s a mirror of what ourselves in that regard – but the author didn’t take it to the next step. As individuals we should be seeing ourselves in the output, but we should also understand that it’s also global society’s mirror as well, and all the relative good and relative bad that comes with it. We have biases in content based on language, on culture, on religion, and on much more. I imagine the Amish don’t care, but still they are part of humanity and we have a blind spot there, I’m certain, never-mind all the history that our society has erased and continues to erase, or has simply ignored.

Personally, I find it a great way to poll the known stores of humanity on what it’s biases believe, no matter how disturbing the results can be. And yet, we’re already likely diluting our own thoughts reflected back at us as marketers and bloggers (not mutually exclusive) churn content out of Large Language Models that they will eventually train on. That’s not something I’m comfortable with, and as usual, my problem isn’t so much technology as society, a rare thing for me to say when so much technology is poorly designed. Am I ‘victim shaming’?

When the victim is the one abusing themself, can it be victim shaming?

Our own echo chambers are rather shameless.