Stories
Slash Boxes
Comments

SoylentNews is people

posted by hubie on Wednesday October 12 2022, @12:24AM   Printer-friendly

DeepMind unveils first AI to discover faster matrix multiplication algorithms:

Can artificial intelligence (AI) create its own algorithms to speed up matrix multiplication, one of machine learning’s most fundamental tasks? Today, in a paper published in Nature, DeepMind unveiled AlphaTensor, the “first artificial intelligence system for discovering novel, efficient and provably correct algorithms.” The Google-owned lab said the research “sheds light” on a 50-year-old open question in mathematics about finding the fastest way to multiply two matrices.

Ever since the Strassen algorithm was published in 1969, computer science has been on a quest to surpass its speed of multiplying two matrices. While matrix multiplication is one of algebra’s simplest operations, taught in high school math, it is also one of the most fundamental computational tasks and, as it turns out, one of the core mathematical operations in today’s neural networks.

[...] This research delves into how AI could be used to improve computer science itself, said Pushmeet Kohli, head of AI for science at DeepMind, at a press briefing.

“If we’re able to use AI to find new algorithms for fundamental computational tasks, this has enormous potential because we might be able to go beyond the algorithms that are currently used, which could lead to improved efficiency,” he said.

This is a particularly challenging task, he explained, because the process of discovering new algorithms is so difficult, and automating algorithmic discovery using AI requires a long and difficult reasoning process — from forming intuition about the algorithmic problem to actually writing a novel algorithm and proving that the algorithm is correct on specific instances.

“This is a difficult set of steps and AI has not been very good at that so far,” he said.

[...] According to DeepMind, AlphaTensor discovered algorithms that are more efficient than the state of the art for many matrix sizes and outperform human-designed ones.

AlphaTensor begins without any knowledge about the problem, Kohli explained, and then gradually learns what is happening and improves over time. “It first finds this classroom algorithm that we were taught, and then it finds historical algorithms such as Strassen’s and then at some point, it surpasses them and discovers completely new algorithms that are faster than previously.”

Kohli said he hopes that this paper inspires others in using AI to guide algorithmic discovery for other fundamental competition tasks. “We think this is a major step in our path towards really using AI for algorithmic discovery,” he said.

What other new algorithms will be discovered? I wonder when they will attempt to apply this to factorization?


Original Submission

This discussion was created by hubie (1068) for logged-in users only, but now has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
(1)
  • (Score: 5, Informative) by khallow on Wednesday October 12 2022, @12:56AM (2 children)

    by khallow (3766) Subscriber Badge on Wednesday October 12 2022, @12:56AM (#1276154) Journal
    Ordinary matrix multiplication on square matrices takes N^3 multiplications where N is the number of rows. The Strassen approach [wikipedia.org] decomposes a matrix by splitting it in half both rows and columns (there are ways to deal with odd numbers of rows/columns and the matrix pair, being multiplied, need not be square in order to be exploitable) and multiplying the 2 by 2 blocks in a way that require only 7 multiplications (and a few N^2 additions, but not a significant addition for a large enough matrix). This can be recursively bisected until you get to the lower limit where the algorithm isn't worth the bother.

    The result is that N^3 multiplications is reduced to roughly N^(7 log base 2) ~ N^2.8. Mathematicians have reduced this exponent further with some weird group theoretic approaches (such as here [siam.org], but presently it works only on ludicrously large matrices.

    The theoretical limit for normal matrix multiplication, using these tricks, is thought to be N^2, but presently, the closer you get to that exponent, the more vast you need your matrix to be in order to see those computation savings. For most purposes, it hasn't been worthwhile to look at matrix multiplication using an algorithm more sophisticated than the Strassen algorithm.
    • (Score: 5, Interesting) by nostyle on Wednesday October 12 2022, @04:21AM (1 child)

      by nostyle (11497) on Wednesday October 12 2022, @04:21AM (#1276188) Journal

      Toeplitz matrices [wikipedia.org] need only order(N^2 ) operations to invert.

      Long ago (late eighties) I had to code a matrix solver to compute the inductance of various configurations of conductors. I discovered that excepting the first two rows and columns, the equations fell naturally into a Toeplitz configuration. Then by solving the bulk of the matrix (300 rows typically on a PC of that era) using a Levinson algorithm, the total computation time scaled quite nearly at N^2.

      --
      "Just machines to make big decisions programmed by fellas who can cash in in vision", -Donald Fagen, I.G.Y. // what I hear there

      • (Score: 1) by khallow on Wednesday October 12 2022, @04:49AM

        by khallow (3766) Subscriber Badge on Wednesday October 12 2022, @04:49AM (#1276196) Journal

        Toeplitz matrices need only order(N^2 ) operations to invert.

        Probably easy to multiply too - though my first lazy attempts ended up being order N^3.

        Nice approach. It is interesting how so much of these things is a nice problem plus some edge weirdness.

  • (Score: 5, Insightful) by stormwyrm on Wednesday October 12 2022, @12:56AM (9 children)

    by stormwyrm (717) on Wednesday October 12 2022, @12:56AM (#1276155) Journal

    Here is the paper. [nature.com] (doi:10.1038/s41586-022-05172-4). And the blog post from the researchers [deepmind.com]. TFA sounds more like a PR puff piece.

    I am a bit sceptical that this actually represents some kind of practical performance improvement. Strassen's algorithm for 2×2 matrices requires seven multiplications as opposed to the eight required by the standard O(n³) algorithm, but the standard algorithm requires four additions and Strassen's uses 17. I don't think that a single multiplication takes 13 more cycles than an addition on modern silicon. I don't think this was true even when Strassen discovered his algorithm. The 4×5 and 5×5 matrix multiplication, for which the standard algorithm does 100 multiplications while Strassen's does 80, but the new technique Deepmind used has discovered one with 76 multiplications. Now how many additions will it take, and how much more extra storage? Will saving 24 multiplications be worth all the extra additions and memory and cache overhead of the new technique?

    Just like Strassen's algorithm, it seems like no more than a theoretical curiosity. It seems doubtful it can deliver any actual performance gains. Still, it's intriguing that they were able to extend the domain into mathematics.

    --
    Numquam ponenda est pluralitas sine necessitate.
    • (Score: 2, Interesting) by khallow on Wednesday October 12 2022, @01:17AM

      by khallow (3766) Subscriber Badge on Wednesday October 12 2022, @01:17AM (#1276160) Journal

      Just like Strassen's algorithm, it seems like no more than a theoretical curiosity.

      Strassen's algorithm is a genuine speed improvement. Block matrix multiplication is more costly than block matrix addition, and this goes up as the matrix increases in size. The matrix needs to be large enough - IIRC, I think it needs to be somewhere around 32 or more rows/columns in each of the matrix dimensions to justify the overhead of extra additions.

      I don't know about the present Google approach, but my understanding has been that as one gets the theoretical number of block matrix multiplications down, the rest of the overhead goes up. Some of these algorithms see performance advantage at absurd sizes, maybe like millions of rows/columns. They probably also require considerable massaging to get the right number of rows/columns (like Strassen which operates on even rows/columns).

    • (Score: 2) by bzipitidoo on Wednesday October 12 2022, @02:55AM (4 children)

      by bzipitidoo (4388) Subscriber Badge on Wednesday October 12 2022, @02:55AM (#1276177) Journal

      Yes, it will be worth any finite number of extra addition operations to avoid just 1 multiplication. You might have to scale up to very large numbers and large matrices to realize the savings. But it's there.

      The issue is that for m digits, multiplication takes at least m log m operations, if m is large enough for the Fast Fourier Transform to be worth using, otherwise, you're stuck with grade school multiplication, which takes m^2 operations. Addition takes m operations. Even if you have to do 100 extra additions, 100m will be smaller than m log m, for sufficiently large m.

      • (Score: 3, Informative) by stormwyrm on Wednesday October 12 2022, @03:55AM (1 child)

        by stormwyrm (717) on Wednesday October 12 2022, @03:55AM (#1276184) Journal

        Absolute hogwash when we're talking about ordinary everyday systems that most often arise in practice, especially those that use integer and floating point data types that are provided by modern hardware. Even moving data around is not free, and these other algorithms do that way more than the basic one. Strassen's algorithm for instance is actually counterproductive for matrices that are less than around 1000 in size because of the algorithm's substantial overhead. Even for such larger matrices performance gains are marginal at around 10% improvement at best over the basic matrix multiplication algorithm. It also incurs a cost in increased memory usage (meaning in addition to the eight megabytes each of your 1000×1000 double precision floating point matrices already consumes you'll need several times more to multiply them), and for floating point implementations, decreased numerical stability. The Coppersmith–Winograd algorithm and this new one developed at DeepMind are likely even worse in this respect. There is a sizeable constant factor in the performance of these "better" algorithms that is hidden by big-O notation, that make them not worth using except in highly exceptional circumstances.

        --
        Numquam ponenda est pluralitas sine necessitate.
        • (Score: 1) by khallow on Thursday October 13 2022, @03:22PM

          by khallow (3766) Subscriber Badge on Thursday October 13 2022, @03:22PM (#1276447) Journal

          Strassen's algorithm for instance is actually counterproductive for matrices that are less than around 1000 in size because of the algorithm's substantial overhead.

          You're also assuming the existence of fast vector operations which a lot of CPUs have which greatly speed up normal matrix multiplication. If they don't, it achieves its advantage at much smaller matrix size.

      • (Score: 2) by FatPhil on Wednesday October 12 2022, @07:21PM (1 child)

        by FatPhil (863) <{pc-soylent} {at} {asdf.fi}> on Wednesday October 12 2022, @07:21PM (#1276280) Homepage
        You do realise that 'm' is the *size* of the inputs.
        So if lg(m) is 100, then you're talking 10^30 digit numbers. We don't have the capacity to even store such numbers, let alone manipulate them, yet.
        So, nah, 100m is a loss compared to m.lg(m).
        --
        Great minds discuss ideas; average minds discuss events; small minds discuss people; the smallest discuss themselves
        • (Score: 2) by bzipitidoo on Wednesday October 12 2022, @08:56PM

          by bzipitidoo (4388) Subscriber Badge on Wednesday October 12 2022, @08:56PM (#1276304) Journal

          True enough, I went overboard, 100 is too big a constant. But 20 is likely worthwhile.

    • (Score: 2) by FatPhil on Wednesday October 12 2022, @07:34PM (1 child)

      by FatPhil (863) <{pc-soylent} {at} {asdf.fi}> on Wednesday October 12 2022, @07:34PM (#1276284) Homepage
      Can confirm - sniffing arounf their git repo, I came across: https://github.com/deepmind/alphatensor/blob/main/recombination/sota.py
      Which has 220 lines of data, but only these three lines of relevance:
              (3, 4, 5): 47, # Discovered by AlphaTensor.
              (4, 4, 5): 63, # Discovered by AlphaTensor.
              (4, 5, 5): 76, # Discovered by AlphaTensor.
      So it's a minor improvement, even theoretically, to a tiny subset of the problem space. However, still an interesting curiosity.

      I'm surprised that these haven't been sovled by exhaustive search. There are only 2^12 and 2^20 sums, so only 2^32 products. Surely that's a smaller linear relation problem than some NFS factorisations? Or is it a finite cover problem? Or a lattice reduction problem (c.f. Noam Elkies' sets anagram solution to what is a finite cover problem)? Or some other problem type - there's always more than one way to approach these things?
      --
      Great minds discuss ideas; average minds discuss events; small minds discuss people; the smallest discuss themselves
      • (Score: 2) by FatPhil on Wednesday October 12 2022, @07:46PM

        by FatPhil (863) <{pc-soylent} {at} {asdf.fi}> on Wednesday October 12 2022, @07:46PM (#1276288) Homepage
        Ah, they searched much larger matrix sizes than just the m<=n<=p<=12 in that file, and found a bunch of improvement over prior state-of-the-art algorithms.
        --
        Great minds discuss ideas; average minds discuss events; small minds discuss people; the smallest discuss themselves
    • (Score: 3, Funny) by FatPhil on Wednesday October 12 2022, @07:37PM

      by FatPhil (863) <{pc-soylent} {at} {asdf.fi}> on Wednesday October 12 2022, @07:37PM (#1276286) Homepage
      Damn, I forgot about this!

      I think at least you will be smart enough to snigger at these two lines from the SOTA file I linked to above:

            a, b, c = sorted([a, b, c])
      ...
            return int(math.ceil((3 * b * c + max(b, c)) / 2))
      --
      Great minds discuss ideas; average minds discuss events; small minds discuss people; the smallest discuss themselves
  • (Score: 2) by Snotnose on Wednesday October 12 2022, @04:01AM (2 children)

    by Snotnose (1623) on Wednesday October 12 2022, @04:01AM (#1276185)

    Disclaimer: I've been sick for a couple days and cough syrup is my main nutritional source.

    If this is what I first read about a few days ago, the AI algorithm won't work well with modern CPU architectures as it bounces between cache and main memory too much.

    Which is fine, how often do you think your phone needs to do linear algebra such that this makes a difference? Not very would be my guess.

    On the other hand, there are folks out there where the efficiency gains are well worth the R&D needed to bake these results into silicon.

    --
    I just passed a drug test. My dealer has some explaining to do.
    • (Score: 2) by mhajicek on Wednesday October 12 2022, @04:34AM

      by mhajicek (51) Subscriber Badge on Wednesday October 12 2022, @04:34AM (#1276193)

      I'm sure that this will be coming to a GPU near you in a few release generations.

      --
      The spacelike surfaces of time foliations can have a cusp at the surface of discontinuity. - P. Hajicek
    • (Score: 1, Informative) by Anonymous Coward on Wednesday October 12 2022, @04:40AM

      by Anonymous Coward on Wednesday October 12 2022, @04:40AM (#1276195)
      How often does your phone do linear algebra? Quite a lot I imagine. Many of the algorithms for signal processing that a modern phone uses in its software-defined radios, audio and video codecs, and so forth use linear algebraic techniques on a fundamental level. 3D graphics uses them heavily obviously. Not that the algorithm they discovered will make possible any kind of meaningful performance gain in actual practice as the overhead is even worse than Strassen or Coppersmith–Winograd, which were the best matrix multiplication algorithms until this one. They'd only make a difference for a really large matrix, and the performance gains would be marginal at best. Still, it's an interesting theoretical result and perhaps a taste of things to come.
  • (Score: 1, Touché) by Anonymous Coward on Wednesday October 12 2022, @08:25AM (1 child)

    by Anonymous Coward on Wednesday October 12 2022, @08:25AM (#1276212)

    AI making itself faster? Bad sign.

    • (Score: 1) by khallow on Thursday October 13 2022, @03:21PM

      by khallow (3766) Subscriber Badge on Thursday October 13 2022, @03:21PM (#1276445) Journal
      How about we mix this up a bit? You could be the AI that you're making "faster". Would it be a bad sign then?
(1)