SoylentNews
SoylentNews is people
https://soylentnews.org/

Title    A Jargon-Free Explanation of How AI Large Language Models Work
Date    Sunday August 06 2023, @11:16AM
Author    Fnord666
Topic   
from the peering-into-the-abyss dept.
https://soylentnews.org/article.pl?sid=23/08/05/1718249

Freeman writes:

https://arstechnica.com/science/2023/07/a-jargon-free-explanation-of-how-ai-large-language-models-work/

When ChatGPT was introduced last fall, it sent shockwaves through the technology industry and the larger world. Machine learning researchers had been experimenting with large language models (LLMs) for a few years by that point, but the general public had not been paying close attention and didn't realize how powerful they had become.

Today, almost everyone has heard about LLMs, and tens of millions of people have tried them out. But not very many people understand how they work.

If you know anything about this subject, you've probably heard that LLMs are trained to "predict the next word" and that they require huge amounts of text to do this. But that tends to be where the explanation stops. The details of how they predict the next word is often treated as a deep mystery.
[...]
To understand how language models work, you first need to understand how they represent words. Humans represent English words with a sequence of letters, like C-A-T for "cat." Language models use a long list of numbers called a "word vector." For example, here's one way to represent cat as a vector:

[0.0074, 0.0030, -0.0105, 0.0742, 0.0765, -0.0011, 0.0265, 0.0106, 0.0191, 0.0038, -0.0468, -0.0212, 0.0091, 0.0030, -0.0563, -0.0396, -0.0998, -0.0796, ..., 0.0002]

(The full vector is 300 numbers long—to see it all, click here and then click "show the raw vector.")

Why use such a baroque notation? Here's an analogy. Washington, DC, is located at 38.9 degrees north and 77 degrees west. We can represent this using a vector notation:

This is useful for reasoning about spatial relationships.
[...]
For example, the words closest to cat in vector space include dog, kitten, and pet. A key advantage of representing words with vectors of real numbers (as opposed to a string of letters, like C-A-T) is that numbers enable operations that letters don't.

Words are too complex to represent in only two dimensions, so language models use vector spaces with hundreds or even thousands of dimensions.
[...]
Researchers have been experimenting with word vectors for decades, but the concept really took off when Google announced its word2vec project in 2013. Google analyzed millions of documents harvested from Google News to figure out which words tend to appear in similar sentences. Over time, a neural network trained to predict which words co-occur with other words learned to place similar words (like dog and cat) close together in vector space.
[...]
Because these vectors are built from the way humans use words, they end up reflecting many of the biases that are present in human language. For example, in some word vector models, "doctor minus man plus woman" yields "nurse." Mitigating biases like this is an area of active research.
[...]
Traditional software is designed to operate on data that's unambiguous. If you ask a computer to compute "2 + 3," there's no ambiguity about what 2, +, or 3 mean. But natural language is full of ambiguities that go beyond homonyms and polysemy:

People resolve ambiguities like this based on context, but there are no simple or deterministic rules for doing this. Rather, it requires understanding facts about the world. You need to know that mechanics typically fix customers' cars, that students typically do their own homework, and that fruit typically doesn't fly.

Word vectors provide a flexible way for language models to represent each word's precise meaning in the context of a particular passage.
[...]
Research suggests that the first few layers focus on understanding the sentence's syntax and resolving ambiguities like we've shown above. Later layers (which we're not showing to keep the diagram a manageable size) work to develop a high-level understanding of the passage as a whole.
[...]
In short, these nine attention heads enabled GPT-2 to figure out that "John gave a drink to John" doesn't make sense and choose "John gave a drink to Mary" instead.

We love this example because it illustrates just how difficult it will be to fully understand LLMs. The five-member Redwood team published a 25-page paper explaining how they identified and validated these attention heads. Yet even after they did all that work, we are still far from having a comprehensive explanation for why GPT-2 decided to predict "Mary" as the next word.
[...]
In a 2020 paper, researchers from Tel Aviv University found that feed-forward layers work by pattern matching: Each neuron in the hidden layer matches a specific pattern in the input text.
[...]
Recent research from Brown University revealed an elegant example of how feed-forward layers help to predict the next word. Earlier, we discussed Google's word2vec research showing it was possible to use vector arithmetic to reason by analogy. For example, Berlin - Germany + France = Paris.

The Brown researchers found that feed-forward layers sometimes use this exact method to predict the next word.
[...]
All the parts of LLMs we've discussed in this article so far—the neurons in the feed-forward layers and the attention heads that move contextual information between words—are implemented as a chain of simple mathematical functions (mostly matrix multiplications) whose behavior is determined by adjustable weight parameters. Just as the squirrels in my story loosen and tighten the valves to control the flow of water, so the training algorithm increases or decreases the language model's weight parameters to control how information flows through the neural network.
[....]
(If you want to learn more about backpropagation, check out our 2018 explainer on how neural networks work.)
[...]
Over the last five years, OpenAI has steadily increased the size of its language models. In a widely read 2020 paper, OpenAI reported that the accuracy of its language models scaled "as a power-law with model size, dataset size, and the amount of compute used for training, with some trends spanning more than seven orders of magnitude."

The larger their models got, the better they were at tasks involving language. But this was only true if they increased the amount of training data by a similar factor. And to train larger models on more data, you need a lot more computing power.
[...]
Psychologists call this capacity to reason about the mental states of other people "theory of mind." Most people have this capacity from the time they're in grade school. Experts disagree about whether any non-human animals (like chimpanzees) have theory of mind, but there's a general consensus that it's important for human social cognition.

Earlier this year, Stanford psychologist Michal Kosinski published research examining the ability of LLMs to solve theory-of-mind tasks. He gave various language models passages like the one we quoted above and then asked them to complete a sentence like "she believes that the bag is full of." The correct answer is "chocolate," but an unsophisticated language model might say "popcorn" or something else.
[...]
It's worth noting that researchers don't all agree that these results indicate evidence of theory of mind; for example, small changes to the false-belief task led to much worse performance by GPT-3, and GPT-3 exhibits more variable performance across other tasks measuring theory of mind. As one of us (Sean) has written, it could be that successful performance is attributable to confounds in the task—a kind of "clever Hans" effect, only in language models rather than horses.
[...]
In April, researchers at Microsoft published a paper arguing that GPT-4 showed early, tantalizing hints of artificial general intelligence—the ability to think in a sophisticated, human-like way.

For example, one researcher asked GPT-4 to draw a unicorn using an obscure graphics programming language called TiKZ. GPT-4 responded with a few lines of code that the researcher then fed into the TiKZ software. The resulting images were crude, but they showed clear signs that GPT-4 had some understanding of what unicorns look like.
[...]
At the moment, we don't have any real insight into how LLMs accomplish feats like this. Some people argue that such examples demonstrate that the models are starting to truly understand the meanings of the words in their training set. Others insist that language models are "stochastic parrots" that merely repeat increasingly complex word sequences without truly understanding them.
[...]
Further, prediction may be foundational to biological intelligence as well as artificial intelligence. In the view of philosophers like Andy Clark, the human brain can be thought of as a "prediction machine" whose primary job is to make predictions about our environment that can then be used to navigate that environment successfully.
[...]
Traditionally, a major challenge for building language models was figuring out the most useful way of representing different words—especially because the meanings of many words depend heavily on context. The next-word prediction approach allows researchers to sidestep this thorny theoretical puzzle by turning it into an empirical problem. It turns out that if we provide enough data and computing power, language models end up learning a lot about how human language works simply by figuring out how to best predict the next word. The downside is that we wind up with systems whose inner workings we don't fully understand.


Original Submission

Links

  1. "Freeman" - https://soylentnews.org/~Freeman/
  2. "here's one way" - http://vectors.nlpl.eu/explore/embeddings/en/MOD_enwiki_upos_skipgram_300_2_2021/cat_NOUN/
  3. "click here" - http://vectors.nlpl.eu/explore/embeddings/en/MOD_enwiki_upos_skipgram_300_2_2021/cat_NOUN/
  4. "words closest to cat" - http://vectors.nlpl.eu/explore/embeddings/en/MOD_enwiki_upos_skipgram_300_2_2021/cat_NOUN/
  5. "announced its word2vec project" - https://arxiv.org/abs/1301.3781
  6. "biases that are present in human language" - https://www.science.org/doi/full/10.1126/science.aal4230
  7. "Research suggests" - https://arxiv.org/abs/1905.05950
  8. "25-page paper" - https://arxiv.org/abs/2211.00593
  9. "2020 paper" - https://arxiv.org/abs/2012.14913
  10. "research from Brown University" - https://arxiv.org/abs/2305.16130
  11. "matrix multiplications" - https://en.wikipedia.org/wiki/Matrix_multiplication
  12. "2018 explainer" - https://arstechnica.com/science/2018/12/how-computers-got-shockingly-good-at-recognizing-images/
  13. "widely read 2020 paper" - https://arxiv.org/pdf/2001.08361.pdf
  14. "Experts disagree" - https://wires.onlinelibrary.wiley.com/doi/abs/10.1002/wcs.1503
  15. "published research" - https://arxiv.org/abs/2302.02083
  16. "led to much worse performance by GPT-3" - https://arxiv.org/abs/2302.08399
  17. "more variable performance" - https://openreview.net/forum?id=e5Yky8Fnvj
  18. "written" - https://onlinelibrary.wiley.com/doi/full/10.1111/cogs.13309
  19. "published a paper" - https://arxiv.org/abs/2303.12712
  20. "Others insist that language models are "stochastic parrots"" - https://dl.acm.org/doi/abs/10.1145/3442188.3445922
  21. "In the view of philosophers like Andy Clark" - https://www.jstor.org/stable/43820791?casa_token=il-EbpHXQNwAAAAA:tIHPTA3sjEBXs5JVtAFoE5_ImT7gtkvE4tsGjtCd5k6T2nVSR7o9Ae1BzTpDdLIDcChnQxWth83emzu5yPkO54RlyuDUNNM1szcNK6jP7EEuxiMLFepe
  22. "Original Submission" - https://soylentnews.org/submit.pl?op=viewsub&subid=60438

© Copyright 2024 - SoylentNews, All Rights Reserved

printed from SoylentNews, A Jargon-Free Explanation of How AI Large Language Models Work on 2024-09-01 00:02:23