Why I Installed AIs (LLMs) On My Local Systems.

The last few days I’ve been doing some actual experimentation, initially begun because of Daniel Miessler’s Fabric, an Open Source Framework for using artificial intelligence to augment we lowly humans instead of the self-lauding tech bros whose business model falls to, “move fast and break things“.

It’s hard to trust people with that sort of business model when you understand your life is potentially one of those things, and you like that particular thing.

I have generative AIs on all of my machines at home now, which was not as difficult as people might think. I’m writing this part up because to impress upon someone how easy it was, I walked them through doing it in minutes over the phone on a Windows machine. I’ll write that up as my next post, since apparently it seems difficult to people.

For myself, the vision Daniel Miessler brought with his implementation, Fabric, is inspiring in it’s own way though I’m not convinced that AI can make anyone a better human. I think the idea of augmenting is good, and I think with all the infoglut I contend with leaning on a LLM makes sense in a world where everyone else is being sold on the idea of using one, and how to use it.

People who wax poetic about how an AI has changed their lives in good ways are simply waxy poets, as far as I can tell.

For me, with writing and other things I do, there can be value here and there – but I want control. I also don’t want to have to risk my own ideas and thoughts by uploading even a hint of them to someone else’s system. As a software engineer, I have seen loads of data given to companies by users, and I know what can be done with it, and I have seen how flexible ethics can be when it comes to share prices.

Why Installing Your Own LLM is a Good Idea. (Pros)

There are various reasons why, if you’re going to use a LLM, it’s a good idea to have it locally.

(1) Data Privacy and Security: If you’re an individual or a business, you should look after your data and security because nobody else really does, and some profit from your data and lack of security.

(2) Control and Customization: You can fine tune your LLM on your own data (without compromising your privacy and security). As an example, I can feed a LLM various things I’ve written and have it summarize where ideas I’ve written about connect – and even tell me if I have something published where my opinion has changed- without worrying about handing all of that information to someone else. I can tailor it myself – and that isn’t as hard as you think.

(3) Independence from subscription fees; lowered costs: The large companies will sell you as much as you can buy, and before you know it you’re stuck with subscriptions you don’t use. Also, since the technology market is full of companies that get bought out and license agreements changed, you avoid vendor lock-in.

(4) Operating offline; possible improved performance: With the LLM I’m working on, being unable to access the internet during an outage does not stop me from using it. What’s more, my prompts aren’t queued, or prioritized behind someone that pays more.

(5) Quick changes are quick changes: You can iterate faster, try something with your model, and if it doesn’t work, you can find out immediately. This is convenience, and cost-cutting.

(6) Integrate with other tools and systems: You can integrate your LLM with other stuff – as I intend to with Fabric.

(7) You’re not tied to one model. You can use different models with the same installation – and yes, there are lots of models.

The Cons of Using a LLM Locally.

(1) You don’t get to hear someone that sounds like Scarlett Johansson tell you about the picture you uploaded1.

(2) You’re responsible for the processing, memory and storage requirements of your LLM. This is surprisingly not as bad as you would think, but remember – backup, backup, backup.

(3) If you plan to deploy a LLM as a business model, it can get very complicated very quickly. In fact, I don’t know all the details, but that’s nowhere in my long term plans.

Deciding.

In my next post, I’ll write up how to easily install a LLM. I have one on my M1 Mac Mini, my Linux desktop and my Windows laptop. It’s amazingly easy, but going in it can seem very complicated.

What I would suggest about deciding is simply trying it, see how it works for you, or simply know that it’s possible and it will only get easier.

Oh, that quote by Diogenes at the top? No one seems to have a source. Nice thought, though a possible human hallucination.

  1. OK, that was a cheap shot, but I had to get it out of my system. ↩︎

Study Claims Human Writers and Artists Pollute More Than AI.

The second I came across the study, “The carbon emissions of writing and illustrating are lower for AI than for humans“, I knew that there had to be flaws in the study.

The premise of the study seemed weird from the start: What would be the point of it? Why is it that someone thought to compare the carbon footprints of humans and AI for generating images and text? What burning question was trying to be answered?

Is the argument to be that there should be less humans? The way things are going on the planet, that almost seems plausible – people warring and killing people could say, “We’re reducing the carbon footprint of humanity!”, get some carbon credits for it and feel good about their contributions – except if protests around the world are any indicator, that may not sell well.

The answer is likely that since people have been pointing out that the carbon footprint of generative AI is high, they want to be able to have a rebuttal. But there are some questions.

To calculate the carbon footprint of a person writing, we consider the per capita emissions of individuals in different countries. For instance, the emission footprint of a US resident is approximately 15 metric tons CO2e per year22, which translates to roughly 1.7 kg CO2e per hour. Assuming that a person’s emissions while writing are consistent with their overall annual impact, we estimate that the carbon footprint for a US resident producing a page of text (250 words) is approximately 1400 g CO2e. In contrast, a resident of India has an annual impact of 1.9 metric tons22, equating to around 180 g CO2e per page. In this analysis, we use the US and India as examples of countries with the highest and lowest per capita impact among large countries (over 300 M population).

The carbon emissions of writing and illustrating are lower for AI than for humans“, Bill Tomlinson, Rebecca W. Black, Donald J. Patterson & Andrew W. Torrance, Scientific Reports, 14 Feb 2024

What they don’t take into account – to the detriment of we lowly human writers – is that the physical act of writing so many words an hour is not all of writing. In fact, all of writing – real writing – requires the lifetime of sensory inputs as well as thought up to that point. Words don’t just fall out of humans.

This point is important because it’s also true of generative AI. Generative AI is certainly trained on large datasets, but those datasets have come from… where? They therefore inherit the human writer carbon footprint, which would be higher since they have stolen used materials that humans created to feed the training model. Further, every human involved in that process, as well as the maintenance of the system, adds to the carbon footprint. Then there are the materials in the GPUs, the integration, etc.

NVIDIA even has a page on the materials that go into GPUs.

So sure, maybe in generating a few thousand words – we presently call that ‘slop’ – it can do someone’s homework or help one write a monotonous study (they did use ChatGPT3), that carbon footprint might seem to be lower, but overall I’d say that it was actually higher than the average human overall.

Because we humans, in having our average carbon footprint, do other things that raise it: we drive to work, we use electricity to power devices pitched to us to increase our productivity, we cook meals, etc. All of that – all of that – is being added into the mix as if it has no value.

Before generative AI came around, nobody pointed at writers and said, “Those people just have this carbon footprint and they don’t do anything. We should create a generative AI that does it.”. In fact, nobody actually asked for any of that. Then, to have work written by writers sucked into a learning model to be used to generate text to create more slop – of questionable quality, of dubious value, being generated to spam the Internet with – and I apologize to real Spam – less nutritional value and taste.

AI art is much the same, I imagine, but I can’t really draw to save my life and have had the good fortune not to have to. I wrote something about using AI art in blogs that explains my usage, but I would never tell my visual art friends that AI has a lower carbon footprint.

The whole study seems funded by some company that wants a rebuttal to carbon footprints. It is, at best, very limited in how it views the carbon footprints of both we lowly humans and our esteemed ‘colleagues’, generative AI. At worst, it’s meant to prop up propaganda marketing for AI and the people who make the point that on top of the human carbon footprint, generative AI adds significantly more.

Unless, of course, this is a study to demonstrate that we need fewer people and we should do something about it – which some governments are doing right now, unfortunately.

Copyright, AI, And, It Doing It Ethically.

It’s no secret that the generative, sequacious artificial intelligences out there have copyright issues. I’ve written about it myself quite a bit.

It’s almost become cliche to mention copyright and AI in the same sentence, with Sam Altman having said that there would be no way to do generative AI without all that material – toward the end of this post, you’ll see that someone proved that wrong.

Copyright Wars pt. 2: AI vs the Public“, by Toni Aittoniemi in January of 2023, is a really good read on the problem as the large AI companies have sucked in content without permission. If an individual did it, the large companies doing it would call it ‘piracy’, but now, it’s… not? That’s crazy.

The timing of me finding Toni on Mastodon was perfect. Yesterday, I found a story on Wired that demonstrates some of what Toni wrote last year, where he posed a potential way to handle the legal dilemmas surrounding creator’s rights – we call it ‘copyright’ because someone was pretty unimaginative and pushed two words together for only one meaning.

In 2023, OpenAI told the UK parliament that it was “impossible” to train leading AI models without using copyrighted materials. It’s a popular stance in the AI world, where OpenAI and other leading players have used materials slurped up online to train the models powering chatbots and image generators, triggering a wave of lawsuits alleging copyright infringement.

Two announcements Wednesday offer evidence that large language models can in fact be trained without the permissionless use of copyrighted materials.

A group of researchers backed by the French government have released what is thought to be the largest AI training dataset composed entirely of text that is in the public domain. And the nonprofit Fairly Trained announced that it has awarded its first certification for a large language model built without copyright infringement, showing that technology like that behind ChatGPT can be built in a different way to the AI industry’s contentious norm.

“There’s no fundamental reason why someone couldn’t train an LLM fairly,” says Ed Newton-Rex, CEO of Fairly Trained. He founded the nonprofit in January 2024 after quitting his executive role at image-generation startup Stability AI because he disagreed with its policy of scraping content without permission….

Here’s Proof You Can Train an AI Model Without Slurping Copyrighted Content“, Kate Knibbs, Wired.com, March 20th, 2024

It struck me yesterday that a lot of us writing and communicating about the copyright issue didn’t address how it could be handled. It’s not that we don’t know that it couldn’t be handled, it’s just that we haven’t addressed it as much as we should. I went to sleep considering that and in the morning found that Toni had done much of the legwork.

What Toni wrote extends on the system:

…Any training database used to create any commercial AI model should be legally bound to contain an identity that can be linked to a real-world person if so required. This should extend to databases already used to train existing AI’s that do not yet have capabilities to report their sources. This works in two ways to better allow us to integrate AI in our legal frameworks: Firstly, we allow the judicial system to work it’s way with handling the human side of the equation instead of concentrating on mere technological tidbits. Secondly, a requirement of openness will guarantee researches to identify and question the providers of these technologies on issues of equality, fairness or bias in the training data. Creation of new judicial experts at this field will certainly be required from the public sector…

“Copyright Wars pt. 2: AI vs the Public”, Toni Aittoniemi, Gimulnautti, January 13th, 2023.

This is sort of like – and it’s my interpretation – a tokenized citation system built into a system. This would expand on what, as an example, Perplexity AI does by allowing style and ideas to have provenance.

This is some great food for thought for the weekend.

A Quick Look At Perplexity.AI

The Perplexity.AI logo, from their press pack.

In fiddling about on Mastodon, I came across a post linking to an article on Forbes: “‘Like Wikipedia And ChatGPT Had A Kid’: Inside The Buzzy AI Startup Coming For Google’s Lunch“.

Well, that deserved a look if only because search engine results across the board give spammy first pages, Google inclusive, and Wikipedia is a resource that I like because of one main thing: citations.

ChatGPT is… well, it’s interesting, but it’s… limited because you have to double check everything from your prompt to the results. So the idea of mixing the two is definitely attractive.

Thus, I ended up at Perplexity.ai and did some searches, some tricky ones that I know that other search engines often get wrong related to me. Perplexity stumbled in the results but cited the sources that pushed it the wrong way, as well as cited the sources that pushed it the right way.

That’s perfect for me right now. It gives the citations above the response, so you know where stuff is coming from. You can then omit citations that are wrong while drilling down into what it is you’re supposed to be looking into. For me, with the amount of research I do, this saves me a whole lot of tabs in my web browser and therefore allows me a mental health bonus in keeping track of what I’m writing about.

Of course, when I find something useful like this, I put it under bright lights and interrogate it because on impulse I almost subscribed immediately. I’ve held off, at least for now, but so far it has me pondering my ChatGPT4 subscription since it’s much more of what I need and much less of what I don’t. When I am researching things to write, I need to be able to drill down and not be subject to hallucinations. I need the sources. ChatGPT can do that, and ChatGPT gives me access to DALL-E, but how many images do I need? How often do I use ChatGPT? Not that much, really.

I’m also displeased with present behemoth search engines, particularly since they collect information. Does Perplexity.ai collect information on users? According to the Perplexity.AI privacy policy, they do not. That’s at least hopeful. In the shifting landscape of user data, it’s hard to say what the future holds with any company. A buyout, change in management or a shift in the wind of a public toilet could cause policies to change, so we constantly have to keep an eye on that, but in the immediate, it is promising.

My other main query was about the Fediverse, which is notoriously not indexed. This is largely because of the nature of the Fediverse. It didn’t have much on that, as I expected.

I’ll be using it anonymously for a while to see how it works for me. If you’re looking for options for researching topics, Perplexity.ai may be worth a look.

Otherwise I would not have written about it.

WordPress.com, Tumblr to Sell Information For AI Training: What You can do.

I accidentally posted this on RealityFragments.com, but I think it’s important enough to leave it there. The audiences vary, but both have other bloggers on them.

While I was figuring out how to be human in 2024, I missed that Tumblr and WordPress posts will reportedly be used for OpenAI and Midjourney training.

This could be a big deal for people who take the trouble to write their own content rather than filling the web with Generative AI text to just spam out posts.

If you’re involved with WordPress.org, it doesn’t apply to you.

WordPress.com has an option to use Tumblr as well, so when you post to WordPress.com it automagically posts to Tumblr. Therefore you might have to visit both of the posts below and adjust your settings if you don’t want your content to be used in training models.

This doesn’t mean that they haven’t already sent information to Midjourney and OpenAI yet. We don’t really know, but from the moment you change your settings…

  • WordPress.com: How to opt out of the AI training is available here.

    It boils down to this part in your blog settings on WordPress.com:


  • With Tumblr.com, you should check out this post. Tumblr is more tricky, and the link text is pretty small around the images – what you need to remember is after you select your blog on the left sidebar, you need to use the ‘Blog Settings’ link on the right sidebar.

Hot Take.

When I was looking into all of this, it ends up that Automattic, the owners of WordPress.com and Tumblr.com is doing the sale.

If you look at your settings, if you haven’t changed them yet, you’ll see that the default was set to allowing the use of content for training models. The average person who uses these sites to post their content are likely unaware, and in my opinion if they wanted to do this the right way the default setting would be to have these settings opt out.

It’s unclear whether they already sent posts. I’m sure that there’s an army of lawyers who will point out that they did post it in places and that the onus was on users to stay informed. It’s rare for me to use the word ‘shitty’ on KnowProSE.com, but I think it’s probably the best way to describe how this happened.

It was shitty of them to set it up like this. See? It works.

Now some people may not care. They may not be paying users, or they just don’t care, and that’s fine. Personal data? Well, let’s hope that got scrubbed.

Some of us do. I don’t know how many, so I can’t say a lot or a few. Yet if Automattic, the parent company of both Tumblr and WordPress.com, will post that they care about user choices, it hardly seems appropriate that the default choice was not to opt out.

As a paying user of WordPress.com, I think it’s shitty to think I would allow the use of what I write, using my own brain, to be used for a training model that the company gets paid for. I don’t see any of that money. To add injury to that insult of my intelligence, Midjourney and ChatGPT also have subscription to offer the trained AI which I also pay for (ChatGPT).

To make matters worse, we sort of have to take the training models on the word of those that use them. They don’t tell us what’s in them or where the content came from.

This is my opinion. It may not suit your needs, and if you don’t have a pleasant day. But if you agree with this, go ahead, make sure your blog is not allowing third party data sharing.

Personally, I’m unsurprised at how poorly this has been handled. Just follow some of the links early on in the post and revel in dismay.

The Importance of What We Include and Omit.

We choose what we take with us into the future. It’s not always conscious, it’s not always right, but it’s what we do out of practicality – we have generated so much knowledge as a species that it’s impossible for any one person to know everything. What we have put together over the thousands of years of our existence is staggering to consider.

Societies push toward specialization in this regard, enough so that if you’re a polymath people simply don’t believe you can be able to deal with multiple specialties. It’s not even that a polymath is ‘smarter’. It’s largely a matter of how time is spent.

Large language models are polymaths, but since they get the ‘AI’ marketing, they get to sidestep that. You can make up ludicrous things and simply say that ChatGPT said so to some people and they’ll accept it. ChatGPT and large language models have become the new ‘experts’, which I suppose is to be expected when there is a Cult of [Insert Tech Billionaire Here] where suspended disbelief seems to be as important as any other cult.

Image published with permission from The Big Insane Happy, all of their rights are reserved.

The difference is the value signaling. People who want to be like tech billionaires will go out of their way to defend even the most profoundly idiotic things, and that’s a bit of a problem.

The Big Insane Happy made the point quite well.

It’s not just tech billionaires. It’s everything, from religion to ideology to brand of cereal.

Jon Stewart recently made the point very well regarding politics. People have a tendency to support their candidates and completely ignore the problems that the candidates have.

We conveniently omit things.

That’s why this quote hit home this morning:

The first step in liquidating a people is to erase its memory. Destroy its books, its culture, its history. Then have somebody write new books, manufacture a new culture, invent a new history. Before long that nation will begin to forget what it is and what it was… The struggle of man against power is the struggle of memory against forgetting.

Milan Kundera, The Book of Laughter and Forgetting (1979)1

Omission is erasing from memory. Not all of that is bad, but not all of it is great either. This was hotly debated before artificial intelligences, most publicly that I have seen in the United States where statues that venerated Confederate generals in public places were being taken to task because… well, because slavery isn’t something that shouldn’t be venerated, and while there’s debate about whether the Civil War in the United States was fought over slavery or other things, one of the good things that came of it was getting rid of slavery in the United States.

Books are getting banned in schools. “Huckleberry Finn” and “To Kill A Mockingbird” have been removed from schools and libraries in some parts for similar things – for reminding people of how things used to be2. That becomes omission, and it even deepens divides between generations.

There is a lot of room for debate, but the debate needs to be sincere and not people shouting talking point monologues at each other. The victors always write history, but no victor has become one by writing history their history alone. It’s omitting someone else’s.

That’s a big part of lines and walls. It’s not just what we include, it’s also about what we collectively omit.

And this is why the learning models of these things marketed as artificial intelligence, but more of a collective intelligence marketed as artificial intelligence, are so important.

  1. I had to go look up who Milan Kundera was – a very interesting person who started off writing Communist related stuff because he was surrounded by Communism, much as someone who was born into the lines of a theocracy would be influenced to be theocratic, within the lines of democracy he would likely have been democratic. His later works, though, the ones he’s best known for, ‘escaped ideological definition’. ↩︎
  2. Personally, I disagree with that because I think it’s important to understand how things used to be so that we can understand why things are the way that they are now, and why they still need to improve in ways that we’re still figuring out. ↩︎

Forests and Big Pictures.

In working on something I’m writing, I started digging in on the idiom, “Cannot see the forest for the trees”.

The first recorded use of it used the old noun, wood, instead of trees:

“from him who sees no wood for trees
And yet is busie as the bees
From him that’s settled on his lees
And speaketh not without his fees”.

John Heywood, “The Proverbs of John Heywood” (1546), allegedly criticizing the Pope during the reign of Charles II in the first known use of the idiom, “cannot see the forest for the trees”.

I was bending it to a particular use, and thought I’d throw it into what I was writing – but it just looks pedantic there, as in the phrase, ‘unnecessarily pedantic’.

Thus, I looked into ‘the big picture’, whose meaning I believe people of my generation understand pretty well, though it wasn’t used much prior to the 1990s.

However, the first record of it in writing was in 1862!

There was nothing strange in it; it was but a panel from the big picture of life, such a one as you yourself might have traced out during those months spent at the sea-side – a very quiet panel, and I saw it principally through my window.

“A Romance of the Sea-side”, Chapter I, Chambers Journal of Popular Literature, Science and Arts, Conducted by William and Robert Chambers, Saturday, July 19th, 1862.

These encapsulate concepts that probably pre-date these findings. The common concept could be seen as framing, or focusing on different levels – things I consider to be the same things applied differently.

Sadly, I can’t really use this in the project, though I am using the idioms, so I thought I’d toss it up here.

The Ongoing Copyright Issue with Generative AI.

It’s a strange time. OpenAI (and Microsoft) are being sued by the New York Times and they’re claiming ‘Fair Use’ as if they’re having some coffee and discussing what they read in the New York Times, or are about to write a blog post about the entire published NYT archives, on demand.

It’s not just the New York Times, either. More and more authors are doing the same, or started before NYT.

IEEE’s article, “Generative AI has a Visual Plagiarism problem” demonstrates issues that back up the copyright claims. This is not regurgitation, this is not fair use – there is line by line text from the New York Times, amongst other things.

As I noted yesterday, OpenAI is making deals now for content and only caught this morning that, ‘The New York Times, too, had been in conversations with OpenAI to establish a “high-value” partnership involving “real-time display” of its brand in ChatGPT, OpenAI’s AI-powered chatbot.‘.

Clearly, discussions didn’t work out. I was going to link the New York Times article on it, but it seems I’ve used up my freebies so I can’t actually read it right now unless I subscribe.1 At this end of things, as a simple human being, I’m subject to paywalls for content, but OpenAI hasn’t been. If I can’t read and cite an article from the New York Times for free, why should they be able to?

On the other hand, when I get content that originated from news sites like the New York Times, there is fair use happening. People transform what they have read and regurgitate it, some more intellligibly than others, much like an artificial intelligence, but there is at least etiquette – linking the source, at the least. This is not something OpenAI does. It doesn’t give credit. It just inhales large amounts of text, the algorithms decide on the best ways to spit them out to answer prompts. Like blogging, only faster, and like blogging, sometimes it just makes stuff up.

This is not unlike a well read person doing the same. Ideas, thoughts, even memes are experiences we draw upon. What makes these generative artificial intelligences different? Speed. They also consume a lot more water, apparently.

The line has to be drawn somewhere, and since OpenAI isn’t living up to the first part of it’s name and is not being transparent, people are left poking a black box to see if their copyright has been sucked in without permission, mention, or recompense.

That does seem a bit like unfair use. This is not to say that the copyright system couldn’t use an overhaul, but apparently so could how generative AIs get their content.

What is that ‘Open’ in OpenAI mean, anyway?

  1. They do seem to have a good deal right now, I did try to subscribe but it failed for some obscure reason. I’ll try again later. $10 for a year of the New York Times is a definite deal, if only they could process my payment this morning. ↩︎

Perfect Space for Reading and Writing?

Daily writing prompt
You get to build your perfect space for reading and writing. What’s it like?

I’ve tried evolving things over the years, and what I have found is that it’s not where I write that matters. It’s how I feel that matters.

Sometimes it means sitting at the big white dining table in the living room, as I am now, even ignoring the mess off to the right since I’m mid-reorganization.

Sometimes It do it outside on my balcony, with the raw cedar – freshly polished today.

The only place I don’t write is in the bedroom, really. Well, the bathrooms too.

I used to have romantic ideas of writing on the beach. That’s a bad idea. Sand, corrosive stuff all over – I will write in notebooks, but then the sun is never quite right, the wind never quite right, the sand all over… and on every beach I’ve been to in every country, invariably there’s some idiot with a big speaker system in their car who really wants to play me the song of his people.

The things I need for writing are an idea that has congealed. Once I have that, writing is a simple task.

Today I did not have one, so I finally used one of the writing prompts.