https://arstechnica.com/science/2023/07/a-jargon-free-explanation-of-how-ai-large-language-models-work/ [arstechnica.com]
When ChatGPT was introduced last fall, it sent shockwaves through the technology industry and the larger world. Machine learning researchers had been experimenting with large language models (LLMs) for a few years by that point, but the general public had not been paying close attention and didn’t realize how powerful they had become.
Today, almost everyone has heard about LLMs, and tens of millions of people have tried them out. But not very many people understand how they work.
If you know anything about this subject, you’ve probably heard that LLMs are trained to “predict the next word” and that they require huge amounts of text to do this. But that tends to be where the explanation stops. The details of how they predict the next word is often treated as a deep mystery.
[...]
To understand how language models work, you first need to understand how they represent words. Humans represent English words with a sequence of letters, like C-A-T for "cat." Language models use a long list of numbers called a "word vector." For example, here’s one way [vectors.nlpl.eu] to represent cat as a vector:[0.0074, 0.0030, -0.0105, 0.0742, 0.0765, -0.0011, 0.0265, 0.0106, 0.0191, 0.0038, -0.0468, -0.0212, 0.0091, 0.0030, -0.0563, -0.0396, -0.0998, -0.0796, …, 0.0002]
(The full vector is 300 numbers long—to see it all, click here [vectors.nlpl.eu] and then click “show the raw vector.”)
Why use such a baroque notation? Here’s an analogy. Washington, DC, is located at 38.9 degrees north and 77 degrees west. We can represent this using a vector notation:
- Washington, DC, is at [38.9, 77]
- New York is at [40.7, 74]
- London is at [51.5, 0.1]
- Paris is at [48.9, -2.4]
This is useful for reasoning about spatial relationships.
[...]
For example, the words closest to cat [vectors.nlpl.eu] in vector space include dog, kitten, and pet. A key advantage of representing words with vectors of real numbers (as opposed to a string of letters, like C-A-T) is that numbers enable operations that letters don’t.Words are too complex to represent in only two dimensions, so language models use vector spaces with hundreds or even thousands of dimensions.
[...]
Researchers have been experimenting with word vectors for decades, but the concept really took off when Google announced its word2vec project [arxiv.org] in 2013. Google analyzed millions of documents harvested from Google News to figure out which words tend to appear in similar sentences. Over time, a neural network trained to predict which words co-occur with other words learned to place similar words (like dog and cat) close together in vector space.
[...]
Because these vectors are built from the way humans use words, they end up reflecting many of the biases that are present in human language [science.org]. For example, in some word vector models, "doctor minus man plus woman" yields "nurse." Mitigating biases like this is an area of active research.
[...]
Traditional software is designed to operate on data that’s unambiguous. If you ask a computer to compute “2 + 3,” there’s no ambiguity about what 2, +, or 3 mean. But natural language is full of ambiguities that go beyond homonyms and polysemy:
- In “the customer asked the mechanic to fix his car,” does "his" refer to the customer or the mechanic?
- In “the professor urged the student to do her homework” does "her" refer to the professor or the student?
- In “fruit flies like a banana” is "flies" a verb (referring to fruit soaring across the sky) or a noun (referring to banana-loving insects)?
People resolve ambiguities like this based on context, but there are no simple or deterministic rules for doing this. Rather, it requires understanding facts about the world. You need to know that mechanics typically fix customers’ cars, that students typically do their own homework, and that fruit typically doesn’t fly.
Word vectors provide a flexible way for language models to represent each word’s precise meaning in the context of a particular passage.
[...]
Research suggests [arxiv.org] that the first few layers focus on understanding the sentence's syntax and resolving ambiguities like we’ve shown above. Later layers (which we’re not showing to keep the diagram a manageable size) work to develop a high-level understanding of the passage as a whole.
[...]
In short, these nine attention heads enabled GPT-2 to figure out that “John gave a drink to John” doesn’t make sense and choose “John gave a drink to Mary” instead.We love this example because it illustrates just how difficult it will be to fully understand LLMs. The five-member Redwood team published a 25-page paper [arxiv.org] explaining how they identified and validated these attention heads. Yet even after they did all that work, we are still far from having a comprehensive explanation for why GPT-2 decided to predict "Mary" as the next word.
[...]
In a 2020 paper [arxiv.org], researchers from Tel Aviv University found that feed-forward layers work by pattern matching: Each neuron in the hidden layer matches a specific pattern in the input text.
[...]
Recent research from Brown University [arxiv.org] revealed an elegant example of how feed-forward layers help to predict the next word. Earlier, we discussed Google’s word2vec research showing it was possible to use vector arithmetic to reason by analogy. For example, Berlin - Germany + France = Paris.The Brown researchers found that feed-forward layers sometimes use this exact method to predict the next word.
[...]
All the parts of LLMs we’ve discussed in this article so far—the neurons in the feed-forward layers and the attention heads that move contextual information between words—are implemented as a chain of simple mathematical functions (mostly matrix multiplications [wikipedia.org]) whose behavior is determined by adjustable weight parameters. Just as the squirrels in my story loosen and tighten the valves to control the flow of water, so the training algorithm increases or decreases the language model’s weight parameters to control how information flows through the neural network.
[....]
(If you want to learn more about backpropagation, check out our 2018 explainer [arstechnica.com] on how neural networks work.)
[...]
Over the last five years, OpenAI has steadily increased the size of its language models. In a widely read 2020 paper [arxiv.org], OpenAI reported that the accuracy of its language models scaled “as a power-law with model size, dataset size, and the amount of compute used for training, with some trends spanning more than seven orders of magnitude.”The larger their models got, the better they were at tasks involving language. But this was only true if they increased the amount of training data by a similar factor. And to train larger models on more data, you need a lot more computing power.
[...]
Psychologists call this capacity to reason about the mental states of other people “theory of mind.” Most people have this capacity from the time they’re in grade school. Experts disagree [wiley.com] about whether any non-human animals (like chimpanzees) have theory of mind, but there’s a general consensus that it's important for human social cognition.Earlier this year, Stanford psychologist Michal Kosinski published research [arxiv.org] examining the ability of LLMs to solve theory-of-mind tasks. He gave various language models passages like the one we quoted above and then asked them to complete a sentence like “she believes that the bag is full of.” The correct answer is “chocolate,” but an unsophisticated language model might say “popcorn” or something else.
[...]
It’s worth noting that researchers don’t all agree that these results indicate evidence of theory of mind; for example, small changes to the false-belief task led to much worse performance by GPT-3 [arxiv.org], and GPT-3 exhibits more variable performance [openreview.net] across other tasks measuring theory of mind. As one of us (Sean) has written [wiley.com], it could be that successful performance is attributable to confounds in the task—a kind of “clever Hans” effect, only in language models rather than horses.
[...]
In April, researchers at Microsoft published a paper [arxiv.org] arguing that GPT-4 showed early, tantalizing hints of artificial general intelligence—the ability to think in a sophisticated, human-like way.For example, one researcher asked GPT-4 to draw a unicorn using an obscure graphics programming language called TiKZ. GPT-4 responded with a few lines of code that the researcher then fed into the TiKZ software. The resulting images were crude, but they showed clear signs that GPT-4 had some understanding of what unicorns look like.
[...]
At the moment, we don’t have any real insight into how LLMs accomplish feats like this. Some people argue that such examples demonstrate that the models are starting to truly understand the meanings of the words in their training set. Others insist that language models are “stochastic parrots” [acm.org] that merely repeat increasingly complex word sequences without truly understanding them.
[...]
Further, prediction may be foundational to biological intelligence as well as artificial intelligence. In the view of philosophers like Andy Clark [jstor.org], the human brain can be thought of as a “prediction machine” whose primary job is to make predictions about our environment that can then be used to navigate that environment successfully.
[...]
Traditionally, a major challenge for building language models was figuring out the most useful way of representing different words—especially because the meanings of many words depend heavily on context. The next-word prediction approach allows researchers to sidestep this thorny theoretical puzzle by turning it into an empirical problem. It turns out that if we provide enough data and computing power, language models end up learning a lot about how human language works simply by figuring out how to best predict the next word. The downside is that we wind up with systems whose inner workings we don’t fully understand.