[...] The paper's contributing authors include luminaries of the AI world, such as Yoshua Bengio, MILA's scientific director, who is a recipient of a 2019 Turing Award, computing's equivalent of the Nobel Prize. Bengio is widely credited with developing the attention mechanism long before Vaswani and team adapted it for the Transformer.

Also among the authors is Stanford University computer science associate professor Christopher Ré, who has helped in recent years to advance the notion of AI as "software 2.0".

To find a sub-quadratic alternative to attention, Poli and team set about studying how the attention mechanism is doing what it does, to see if that work could be done more efficiently.

A recent practice in AI science, known as mechanistic interpretability, is yielding insights about what is going on deep inside a neural network, inside the computational "circuits" of attention. You can think of it as taking apart software the way you would take apart a clock or a PC to see its parts and figure out how it operates.

One work cited by Poli and team is a set of experiments by researcher Nelson Elhage of AI startup Anthropic. Those experiments take apart the Transformer programs to see what attention is doing.

In essence, what Elhage and team found is that attention functions at its most basic level by very simple computer operations, such as copying a word from recent input and pasting it into the output.

For example, if one starts to type into a large language model program such as ChatGPT a sentence from Harry Potter and the Sorcerer's Stone, such as "Mr. Dursley was the director of a firm called Grunnings...", just typing "D-u-r-s", the start of the name, might be enough to prompt the program to complete the name "Dursley" because it has seen the name in a prior sentence of Sorcerer's Stone. The system is able to copy from memory the record of the characters "l-e-y" to autocomplete the sentence.

However, the attention operation runs into the quadratic complexity problem as the amount of words grows and grows. More words require more of what are known as "weights" or parameters, to run the attention operation.

As the authors write: "The Transformer block is a powerful tool for sequence modeling, but it is not without its limitations. One of the most notable is the computational cost, which grows rapidly as the length of the input sequence increases."

While the technical details of ChatGPT and GPT-4 haven't been disclosed by OpenAI, it is believed they may have a trillion or more such parameters. Running these parameters requires more GPU chips from Nvidia, thus driving up the compute cost.

To reduce that quadratic compute cost, Poli and team replace the attention operation with what's called a "convolution", which is one of the oldest operations in AI programs, refined back in the 1980s. A convolution is just a filter that can pick out items in data, be it the pixels in a digital photo or the words in a sentence.

Poli and team do a kind of mash-up: they take work done by Stanford researcher Daniel Y. Fu and team to apply convolutional filters to sequences of words, and they combine that with work by scholar David Romero and colleagues at the Vrije Universiteit Amsterdam that lets the program change filter size on the fly. That ability to flexibly adapt cuts down on the number of costly parameters, or, weights, the program needs to have.

The result of the mash-up is that a convolution can be applied to an unlimited amount of text without requiring more and more parameters in order to copy more and more data. It's an "attention-free" approach, as the authors put it.