DeepMind unveils first AI to discover faster matrix multiplication algorithms:
Can artificial intelligence (AI) create its own algorithms to speed up matrix multiplication, one of machine learning’s most fundamental tasks? Today, in a paper published in Nature, DeepMind unveiled AlphaTensor, the “first artificial intelligence system for discovering novel, efficient and provably correct algorithms.” The Google-owned lab said the research “sheds light” on a 50-year-old open question in mathematics about finding the fastest way to multiply two matrices.
Ever since the Strassen algorithm was published in 1969, computer science has been on a quest to surpass its speed of multiplying two matrices. While matrix multiplication is one of algebra’s simplest operations, taught in high school math, it is also one of the most fundamental computational tasks and, as it turns out, one of the core mathematical operations in today’s neural networks.
[...] This research delves into how AI could be used to improve computer science itself, said Pushmeet Kohli, head of AI for science at DeepMind, at a press briefing.
“If we’re able to use AI to find new algorithms for fundamental computational tasks, this has enormous potential because we might be able to go beyond the algorithms that are currently used, which could lead to improved efficiency,” he said.
This is a particularly challenging task, he explained, because the process of discovering new algorithms is so difficult, and automating algorithmic discovery using AI requires a long and difficult reasoning process — from forming intuition about the algorithmic problem to actually writing a novel algorithm and proving that the algorithm is correct on specific instances.
“This is a difficult set of steps and AI has not been very good at that so far,” he said.
[...] According to DeepMind, AlphaTensor discovered algorithms that are more efficient than the state of the art for many matrix sizes and outperform human-designed ones.
AlphaTensor begins without any knowledge about the problem, Kohli explained, and then gradually learns what is happening and improves over time. “It first finds this classroom algorithm that we were taught, and then it finds historical algorithms such as Strassen’s and then at some point, it surpasses them and discovers completely new algorithms that are faster than previously.”
Kohli said he hopes that this paper inspires others in using AI to guide algorithmic discovery for other fundamental competition tasks. “We think this is a major step in our path towards really using AI for algorithmic discovery,” he said.
What other new algorithms will be discovered? I wonder when they will attempt to apply this to factorization?
(Score: 5, Informative) by khallow on Wednesday October 12 2022, @12:56AM (2 children)
The result is that N^3 multiplications is reduced to roughly N^(7 log base 2) ~ N^2.8. Mathematicians have reduced this exponent further with some weird group theoretic approaches (such as here [siam.org], but presently it works only on ludicrously large matrices.
The theoretical limit for normal matrix multiplication, using these tricks, is thought to be N^2, but presently, the closer you get to that exponent, the more vast you need your matrix to be in order to see those computation savings. For most purposes, it hasn't been worthwhile to look at matrix multiplication using an algorithm more sophisticated than the Strassen algorithm.
(Score: 5, Interesting) by nostyle on Wednesday October 12 2022, @04:21AM (1 child)
Toeplitz matrices [wikipedia.org] need only order(N^2 ) operations to invert.
Long ago (late eighties) I had to code a matrix solver to compute the inductance of various configurations of conductors. I discovered that excepting the first two rows and columns, the equations fell naturally into a Toeplitz configuration. Then by solving the bulk of the matrix (300 rows typically on a PC of that era) using a Levinson algorithm, the total computation time scaled quite nearly at N^2.
--
"Just machines to make big decisions programmed by fellas who can cash in in vision", -Donald Fagen, I.G.Y. // what I hear there
(Score: 1) by khallow on Wednesday October 12 2022, @04:49AM
Probably easy to multiply too - though my first lazy attempts ended up being order N^3.
Nice approach. It is interesting how so much of these things is a nice problem plus some edge weirdness.
(Score: 5, Insightful) by stormwyrm on Wednesday October 12 2022, @12:56AM (9 children)
Here is the paper. [nature.com] (doi:10.1038/s41586-022-05172-4). And the blog post from the researchers [deepmind.com]. TFA sounds more like a PR puff piece.
I am a bit sceptical that this actually represents some kind of practical performance improvement. Strassen's algorithm for 2×2 matrices requires seven multiplications as opposed to the eight required by the standard O(n³) algorithm, but the standard algorithm requires four additions and Strassen's uses 17. I don't think that a single multiplication takes 13 more cycles than an addition on modern silicon. I don't think this was true even when Strassen discovered his algorithm. The 4×5 and 5×5 matrix multiplication, for which the standard algorithm does 100 multiplications while Strassen's does 80, but the new technique Deepmind used has discovered one with 76 multiplications. Now how many additions will it take, and how much more extra storage? Will saving 24 multiplications be worth all the extra additions and memory and cache overhead of the new technique?
Just like Strassen's algorithm, it seems like no more than a theoretical curiosity. It seems doubtful it can deliver any actual performance gains. Still, it's intriguing that they were able to extend the domain into mathematics.
Numquam ponenda est pluralitas sine necessitate.
(Score: 2, Interesting) by khallow on Wednesday October 12 2022, @01:17AM
Strassen's algorithm is a genuine speed improvement. Block matrix multiplication is more costly than block matrix addition, and this goes up as the matrix increases in size. The matrix needs to be large enough - IIRC, I think it needs to be somewhere around 32 or more rows/columns in each of the matrix dimensions to justify the overhead of extra additions.
I don't know about the present Google approach, but my understanding has been that as one gets the theoretical number of block matrix multiplications down, the rest of the overhead goes up. Some of these algorithms see performance advantage at absurd sizes, maybe like millions of rows/columns. They probably also require considerable massaging to get the right number of rows/columns (like Strassen which operates on even rows/columns).
(Score: 2) by bzipitidoo on Wednesday October 12 2022, @02:55AM (4 children)
Yes, it will be worth any finite number of extra addition operations to avoid just 1 multiplication. You might have to scale up to very large numbers and large matrices to realize the savings. But it's there.
The issue is that for m digits, multiplication takes at least m log m operations, if m is large enough for the Fast Fourier Transform to be worth using, otherwise, you're stuck with grade school multiplication, which takes m^2 operations. Addition takes m operations. Even if you have to do 100 extra additions, 100m will be smaller than m log m, for sufficiently large m.
(Score: 3, Informative) by stormwyrm on Wednesday October 12 2022, @03:55AM (1 child)
Absolute hogwash when we're talking about ordinary everyday systems that most often arise in practice, especially those that use integer and floating point data types that are provided by modern hardware. Even moving data around is not free, and these other algorithms do that way more than the basic one. Strassen's algorithm for instance is actually counterproductive for matrices that are less than around 1000 in size because of the algorithm's substantial overhead. Even for such larger matrices performance gains are marginal at around 10% improvement at best over the basic matrix multiplication algorithm. It also incurs a cost in increased memory usage (meaning in addition to the eight megabytes each of your 1000×1000 double precision floating point matrices already consumes you'll need several times more to multiply them), and for floating point implementations, decreased numerical stability. The Coppersmith–Winograd algorithm and this new one developed at DeepMind are likely even worse in this respect. There is a sizeable constant factor in the performance of these "better" algorithms that is hidden by big-O notation, that make them not worth using except in highly exceptional circumstances.
Numquam ponenda est pluralitas sine necessitate.
(Score: 1) by khallow on Thursday October 13 2022, @03:22PM
You're also assuming the existence of fast vector operations which a lot of CPUs have which greatly speed up normal matrix multiplication. If they don't, it achieves its advantage at much smaller matrix size.
(Score: 2) by FatPhil on Wednesday October 12 2022, @07:21PM (1 child)
So if lg(m) is 100, then you're talking 10^30 digit numbers. We don't have the capacity to even store such numbers, let alone manipulate them, yet.
So, nah, 100m is a loss compared to m.lg(m).
Great minds discuss ideas; average minds discuss events; small minds discuss people; the smallest discuss themselves
(Score: 2) by bzipitidoo on Wednesday October 12 2022, @08:56PM
True enough, I went overboard, 100 is too big a constant. But 20 is likely worthwhile.
(Score: 2) by FatPhil on Wednesday October 12 2022, @07:34PM (1 child)
Which has 220 lines of data, but only these three lines of relevance:
(3, 4, 5): 47, # Discovered by AlphaTensor.
(4, 4, 5): 63, # Discovered by AlphaTensor.
(4, 5, 5): 76, # Discovered by AlphaTensor.
So it's a minor improvement, even theoretically, to a tiny subset of the problem space. However, still an interesting curiosity.
I'm surprised that these haven't been sovled by exhaustive search. There are only 2^12 and 2^20 sums, so only 2^32 products. Surely that's a smaller linear relation problem than some NFS factorisations? Or is it a finite cover problem? Or a lattice reduction problem (c.f. Noam Elkies' sets anagram solution to what is a finite cover problem)? Or some other problem type - there's always more than one way to approach these things?
Great minds discuss ideas; average minds discuss events; small minds discuss people; the smallest discuss themselves
(Score: 2) by FatPhil on Wednesday October 12 2022, @07:46PM
Great minds discuss ideas; average minds discuss events; small minds discuss people; the smallest discuss themselves
(Score: 3, Funny) by FatPhil on Wednesday October 12 2022, @07:37PM
I think at least you will be smart enough to snigger at these two lines from the SOTA file I linked to above:
a, b, c = sorted([a, b, c])
...
return int(math.ceil((3 * b * c + max(b, c)) / 2))
Great minds discuss ideas; average minds discuss events; small minds discuss people; the smallest discuss themselves
(Score: 2) by Snotnose on Wednesday October 12 2022, @04:01AM (2 children)
Disclaimer: I've been sick for a couple days and cough syrup is my main nutritional source.
If this is what I first read about a few days ago, the AI algorithm won't work well with modern CPU architectures as it bounces between cache and main memory too much.
Which is fine, how often do you think your phone needs to do linear algebra such that this makes a difference? Not very would be my guess.
On the other hand, there are folks out there where the efficiency gains are well worth the R&D needed to bake these results into silicon.
I just passed a drug test. My dealer has some explaining to do.
(Score: 2) by mhajicek on Wednesday October 12 2022, @04:34AM
I'm sure that this will be coming to a GPU near you in a few release generations.
The spacelike surfaces of time foliations can have a cusp at the surface of discontinuity. - P. Hajicek
(Score: 1, Informative) by Anonymous Coward on Wednesday October 12 2022, @04:40AM
(Score: 1, Touché) by Anonymous Coward on Wednesday October 12 2022, @08:25AM (1 child)
AI making itself faster? Bad sign.
(Score: 1) by khallow on Thursday October 13 2022, @03:21PM