Does Artificial Intelligence Training Data Matter?

How Important Is Artificial Intelligence Training Data? And Why Should You Care

A Crucial Facet of AI Training

In this guest post, L. N. Hunter discusses the training data devoured by artificial intelligence, and how that data affects the veracity of its responses.

L.N. hails from the UK; therefore, his grammar and spelling is based on UK English.

What Good Is AI Trained on Fiction?

Recently there’s been a lot of publicity about Meta training its LLM (Large Language Model, a type of AI — Artificial Intelligence) on books pirated by the LibGen folk (who should share as much of the blame as Meta, if not more, for stealing books in the first place [1]). This got me wondering: why train on fiction? And that led to: what is AI trained on fiction good for?

I write mainly SFF and horror — what on earth would an LLM ‘learn’ from my scribbles? How some dragons just want to be left alone to bake cakes (Dragons, Damsels and Danish Pastries), several ways to die horribly (Itch, Escape and Birth of a Serial Killer, to name but a few), how zombie private investigators navigate the world (the Saintsville series), or that machines will take over the world (Hiding in Plain Sight and I Was There When the Singularity Happened)? Actually, the last few might give AI ideas, so best keep those out of the training data!

And Your Reaction?

Oh, come on, you protest, there are so many stories out there that those specific examples won’t surface … But still, evidence seems to suggest that LLMs are regurgitating some texts verbatim, or near enough [2]. Putting that specific detail to one side, look at the wider domain of fiction — what sort of language is AI seeing? Most books aren’t ‘realistic’ — for example, fictional dialogue is far from real dialogue (it contains few of the ‘ums’ and ‘ers’ that punctuate real-life speech; and it tends to get to the point quickly in order to support the plot rather than wandering like a genuine conversation).

And then there’s LibGen’s non-fiction segment: the language of scientific papers, etc., is even less representative of natural language. Besides LibGen, LLMs are also getting material from Twitter (er, I mean X), Reddit, and general web crawls (much to the annoyance of web managers who find that their robots.txt files are often ignored). How on earth can anyone call that natural language!?

It all feels like only ever having seen car crashes in movies, and concluding that every car will burst into flames at the slightest provocation.

A Detour

Let me take a slight detour via fall detection … Various systems exist to detect when an elderly or frail person topples over, so that assistance will be called automatically. A laudable goal, but how do you make such a system? Putting aside the ethical concerns of stealing Gramps’ walking stick to watch him fall over, there aren’t that many examples of real elders getting into trouble like this — definitely not enough to train a system based on cameras or data from wearable devices. Instead, researchers rely on simulation — stunt actors deliberately falling over [3].

Most of these participants are young and fit, somewhat different from the population of concern, and these people have been trained to fall in ways which don’t hurt them (much), and probably intended to look good on screen — it’s not at all clear how representative those falls are of reality. While these fall detection systems do indeed work to some degree, their performance in real life consistently falls (sorry) short of that on simulated data [4].

In other words, and this should come as no surprise, a system trained on a particular data set performs best on data from that set, and worse on data beyond it. Mapping that to LLMs, they’ll perform better on tasks involving language that closely matches the training set than on tasks beyond it.

However …

We are using LLMs trained on fiction and other weird examples of language, and expecting them to work on real world problems. As with fall detection systems trained on the ‘wrong’ data, they clearly can perform some neat tricks, but they’re going to struggle with the wide range of problems in the real world.

Is it a case of making the best of the data available (well, available if you’re willing to dip your hand in the till a little)? And what else can you do?

  1. Have machines monitor every single aspect of our lives, so that they can build a totally realistic model of what we want AI to do (I wonder how many AI companies would like to do just that, and damn the ethical considerations); or
  2. Restrict AI to domains where it does have enough ‘knowledge’ to behave correctly much closer to 100% of the time.

As an example of (b), generative AI seems to be performing very well in the field of drug discovery [5], where the query space exactly matches that of the training data. Maybe LLMs are actually only good for querying stolen literature and random web data.

Finally, even if (a) were possible, I don’t think it’s sufficient: humans don’t learn language only from books or chat groups, they learn from dynamic interactions with others; and humans make up new language too.

Then again, maybe I’m just being naïve and pessimistic, and current LLMs (despite being a dead end [6]), will be enough to bootstrap cheerful android buddies we can truly communicate with, and then AI will flourish.

I’ll leave it to you to decide if that’s a good thing.

References

[1] https://horrortree.com/a-stolen-livelihood-libgen-meta-and-the-atlantic/

[2] https://garymarcus.substack.com/p/partial-regurgitation-and-how-llms

[3] https://agingmo.com/wp-content/uploads/2017/10/JNCQ-Falls-Tech-Stunt-Actors-233-2008.pdf

[4] https://www.researchgate.net/publication/5845301_Fall_detection_-_Principles_and_Methods

[5] https://www.wsj.com/articles/ai-powered-databases-boost-the-alzheimers-drug-discovery-process-b9b75180

[6] https://medium.com/@jankammerath/the-ai-dilemma-when-large-language-model-training-reaches-a-dead-end-e2cf1de4a2ad

© L.N. Hunter

About L.N. Hunter

The Feather and the Lamp

L.N. Hunter’s comic fantasy novel, The Feather and the Lamp (Three Ravens Publishing), sits alongside works in anthologies such as the BSFA award-winning Best of British Science Fiction 2022, as well as several issues of Short Édition’s Short Circuit and the Horrifying Tales of Wonder podcast. There have also been papers in the IEEE Transactions on Neural Networks, which are probably somewhat less relevant and definitely less entertaining. When not writing, L.N. occasionally masquerades as a software developer or can be found unwinding in a disorganised home in Carlisle, UK, along with two cats and a soulmate.

Get in touch via https://linktr.ee/l.n.hunter or https://www.facebook.com/L.N.Hunter.writer


Find thousands of writing tips and word lists in
The Writer’s Lexicon series
and additional resources on my Facebook page.

Discover more from KathySteinemann.com: Free Resources for Writers

Subscribe to get the latest posts sent to your email.

Please don't be shy. Leave a reply.

Your email address will not be published. Required fields are marked *

5 thoughts on “Does Artificial Intelligence Training Data Matter?

    • Thanks. It’d be interesting for someone to take a proper look at how training data affects LLMs. I recall (but can’t fine now) some work with pre-LLM bots where training on ‘uncensored’ data led to a very, um, unfriendly bot; and there was Microsoft’s Tay ‘learning’ online biases from users – seems clear that the data matters, but not how much when looking at huge datasets.

  1. Very helpful. Thank you. I think your point “humans don’t learn language only from books or chat groups, they learn from dynamic interactions with others; and humans make up new language too” is paramount to the issue. No AI will replace the human mind, emotions, or spirit that is unique to each of us. Writers and readers need to demand human-authored literature.

    • Thanks. There’s possibly an interesting chicken-and-egg situation here. If we have ‘interactive androids’ we can converse with them to improve their language skills, but they need to have good language skills to be properly interactive. It took eons of evolution for us fleshy animals to reach that point – wonder how long it’ll take machines 😮 And more seriously, whether the current generation of AI will be a stepping stone along the way.