Stories
Slash Boxes
Comments

SoylentNews is people

posted by martyb on Monday January 27 2020, @02:29PM   Printer-friendly

Tool predicts how fast code will run on a chip:

[...] In [a] series of conference papers, the researchers describe a novel machine-learning pipeline that automates this process, making it easier, faster, and more accurate. In a paper presented at the International Conference on Machine Learning in June, the researchers presented Ithemal, a neural-network model that trains on labeled data in the form of “basic blocks” — fundamental snippets of computing instructions — to automatically predict how long it takes a given chip to execute previously unseen basic blocks. Results suggest Ithemal performs far more accurately than traditional hand-tuned models.

Then, at the November IEEE International Symposium on Workload Characterization, the researchers presented a benchmark suite of basic blocks from a variety of domains, including machine learning, compilers, cryptography, and graphics that can be used to validate performance models. They pooled more than 300,000 of the profiled blocks into an open-source dataset called BHive. During their evaluations, Ithemal predicted how fast Intel chips would run code even better than a performance model built by Intel itself.

Ultimately, developers and compilers can use the tool to generate code that runs faster and more efficiently on an ever-growing number of diverse and “black box” chip designs. “Modern computer processors are opaque, horrendously complicated, and difficult to understand. It is also incredibly challenging to write computer code that executes as fast as possible for these processors,” says co-author on all three papers Michael Carbin, an assistant professor in the Department of Electrical Engineering and Computer Science (EECS) and a researcher in the Computer Science and Artificial Intelligence Laboratory (CSAIL). “This tool is a big step forward toward fully modeling the performance of these chips for improved efficiency.”

Most recently, in a paper presented at the NeurIPS conference in December, the team proposed a new technique to automatically generate compiler optimizations.  Specifically, they automatically generate an algorithm, called Vemal, that converts certain code into vectors, which can be used for parallel computing. Vemal outperforms hand-crafted vectorization algorithms used in the LLVM compiler — a popular compiler used in the industry.

[...] “Intel’s documents are neither error-free nor complete, and Intel will omit certain things, because it’s proprietary,” says co-author on all three papers Charith Mendis, a graduate student in EECS and CSAIL. “However, when you use data, you don’t need to know the documentation. If there’s something hidden you can learn it directly from the data.”

[...] In training, the Ithemal model analyzes millions of automatically profiled basic blocks to learn exactly how different chip architectures will execute computation. Importantly, Ithemal takes raw text as input and does not require manually adding features to the input data. In testing, Ithemal can be fed previously unseen basic blocks and a given chip, and will generate a single number indicating how fast the chip will execute that code.

The researchers found Ithemal cut error rates in accuracy — meaning the difference between the predicted speed versus real-world speed — by 50 percent over traditional hand-crafted models. Further, in their next paper, they showed that Ithemal’s error rate was 10 percent, while the Intel performance-prediction model’s error rate was 20 percent on a variety of basic blocks across multiple different domains.

Articles:

Charith Mendis, Alex Renda, Saman Amarasinghe, Michael Carbin. Ithemal: Accurate, Portable and Fast Basic Block Throughput Estimation using Deep Neural Networks. http://proceedings.mlr.press/v97/mendis19a/mendis19a.pdf

Yishen Chen, Ajay Brahmakshatriya, Charith Mendis, Alex Renda, Eric Atkinson, Ondrej Sykora, Saman Amarasinghe, and Michael Carbin. BHive: A Benchmark Suite and Measurement Framework for Validating x86-64 Basic Block Performance Models http://groups.csail.mit.edu/commit/papers/19/ithemal-measurement.pdf"

Charith Mendis, Cambridge Yang, Yewen Pu, Saman Amarasinghe, Michael Carbin. Compiler Auto-Vectorization with Imitation Learning http://papers.nips.cc/paper/9604-compiler-auto-vectorization-with-imitation-learning.pdf"


Original Submission

This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
(1)
  • (Score: 3, Insightful) by ikanreed on Monday January 27 2020, @02:50PM (8 children)

    by ikanreed (3164) Subscriber Badge on Monday January 27 2020, @02:50PM (#949334) Journal

    I'm pretty sure that one Alan Turing proved this was literally impossible almost a century ago.

    There's not one mention of the halting problem in the summary or any of the papers. That seems like a bit of an oversight.

    If it's purely heuristic, that's fine, but at least make a nod to the impossibility of certainty.

    • (Score: 1) by shrewdsheep on Monday January 27 2020, @02:57PM

      by shrewdsheep (5215) on Monday January 27 2020, @02:57PM (#949336)

      You are correct about the general problem of predicting running time of arbitrary code. They are talking about "basic blocks" though. TLDR; probably this means code without branches (I know the terminology straight blocks for this) for which a prediction used to be very simple in the days of a M68k (look up the cycle count per instruction in your table) but is now difficult to predict (caches, branch prediction, micro-code, etc.).

    • (Score: 2) by The Mighty Buzzard on Monday January 27 2020, @03:00PM

      by The Mighty Buzzard (18) Subscriber Badge <themightybuzzard@proton.me> on Monday January 27 2020, @03:00PM (#949338) Homepage Journal

      Sounds like they're after "better than human" not perfection. Also, there's this [arxiv.org].

      Me, I'm more interested in if the optimizations result in the code not doing exactly what you told it to and causing bugs that are damned hard to troubleshoot.

      --
      My rights don't end where your fear begins.
    • (Score: 2) by JoeMerchant on Monday January 27 2020, @05:33PM

      by JoeMerchant (3937) on Monday January 27 2020, @05:33PM (#949414)

      There's impossible, then there's practical.

      It's impossible to optimally solve the Travelling Salesman problem in less than NP time, but it is possible to come up with pretty good approximations of the optimal solution reliably.

      --
      🌻🌻 [google.com]
    • (Score: 2) by DeathMonkey on Monday January 27 2020, @06:13PM (4 children)

      by DeathMonkey (1380) on Monday January 27 2020, @06:13PM (#949443) Journal

      A Turing Machine has infinite memory. Modern computers don't.

      • (Score: 2) by ikanreed on Monday January 27 2020, @07:29PM (3 children)

        by ikanreed (3164) Subscriber Badge on Monday January 27 2020, @07:29PM (#949493) Journal

        That's up there with "technically all computers halt when the inevitable decay of universal entropy sets in".

        • (Score: 2) by DeathMonkey on Monday January 27 2020, @08:02PM (2 children)

          by DeathMonkey (1380) on Monday January 27 2020, @08:02PM (#949514) Journal

          The halting problem is solvable on Finite State Machines. [semanticscholar.org]

          A processor with finite memory is more like a Finite State Machine. It requires the infinite memory of the Turing Machine to make the halting problem unsolvable.

          • (Score: 2) by ikanreed on Monday January 27 2020, @08:17PM (1 child)

            by ikanreed (3164) Subscriber Badge on Monday January 27 2020, @08:17PM (#949529) Journal

            Inasmuch as a sufficiently large finite state machine can adequately model it, I suppose, but for a typical low end modern computer with 4GB of ram(no hard disk), that would be

            21073741824 distinct "finite" states. That's a little on the big side. I tried to render the number with python to exaggerate my point before I realized we're talking about a 3 followed by 300 million zeroes and it would probably trip the lameness filter.

            • (Score: 2) by ikanreed on Monday January 27 2020, @08:19PM

              by ikanreed (3164) Subscriber Badge on Monday January 27 2020, @08:19PM (#949531) Journal

              sorry, screwed up

              256107374182, not 2107374182. Also 3 billion digits, not 300 million. I'm very sorry for this math error.

  • (Score: 2) by jmichaelhudsondotnet on Monday January 27 2020, @03:09PM (9 children)

    by jmichaelhudsondotnet (8122) on Monday January 27 2020, @03:09PM (#949346) Journal

    “Modern computer processors are opaque, horrendously complicated, and difficult to understand. It is also incredibly challenging to write computer code that executes as fast as possible for these processors,”

    You don't say.

    I propose we change this first, then readdress this problem.

    The personal computer does not need a labrynthine black box of trickery to function, if these companies will not build what we need, then we will look elsewhere, it is not like the idea of a processor can be copyrighted.

    If you think about the energy wasted by these extra labrythine functions, that are also now known to be spyware also, in every computer, that it is mind boggling. You get this top of the line processor and then use it at 75% efficiency because no one understands the features.

    Then things like planned obsolesence or the iphones christmas slowdown feature, ugh.

    thesesystemsarefailing.net

    • (Score: 3, Interesting) by SemperOSS on Monday January 27 2020, @03:59PM (3 children)

      by SemperOSS (5072) on Monday January 27 2020, @03:59PM (#949368)

      While we may all hanker for the Good Old Days®, the now wants faster and faster computers to compensate for the lost art of Code Optimization™, which means super-scalar, vector-optimized, out-of-order executing, branch predicting processors that scrunch every cycle to get the most out of our oversized systems based on Windows, MacOS and systemd (currently Linux-flavoured only).

      As much as I agree that simpler, easier to understand processors would be a boon to the general psychological well-being of programmers galore, it is probably not going to be faster than the current super-complex behemoths. A simplification of the instruction set is not going to happen, even with possible AMD-64 emulation (think Transmeta here), as too much has already been invested in the current instruction set. Not even Intel themselves succeeded in eliminating the x86 architecture, which the thunderous flop of IA-64 shows.

      The nearest we come to efficiency optimizations are the big-little architecture of some phone SoCs where non-demanding functions are handed off to the not-quite-so-power-hungry little processor, leaving the big one for things like emoji texting or some such. ;-)

      But maybe smartphones may rid us of the Intel/AMD duopoly by eliminating desktop/laptop computing as we know it.


      --
      I don't need a signature to draw attention to myself.
      Maybe I should add a sarcasm warning now and again?
      • (Score: 1, Insightful) by Anonymous Coward on Monday January 27 2020, @04:19PM (1 child)

        by Anonymous Coward on Monday January 27 2020, @04:19PM (#949376)

        "maybe smartphones may rid us of the Intel/AMD duopoly by eliminating desktop/laptop computing"

        You do realize how fucked we all would be if being permanently tethered to Big Brother happens.

        • (Score: 2) by jmichaelhudsondotnet on Tuesday January 28 2020, @02:06PM

          by jmichaelhudsondotnet (8122) on Tuesday January 28 2020, @02:06PM (#950057) Journal

          IF there is closed source software, blobs, closed spec chips, etc etc.

          Then yes.

          But a true open system might work, but could you trust the airwaves and wires too? But then in that universe every color of suit on the star trek enterprise knows how to disassemble, audit and then reassemble their communicator.

          Which if you think about it should be the goal for humanity if we perceive ourselves as a race vs like say, the klingons and romulans.

          But if you are idk some hypernationalist religious fanatic it suits you just fine to herd all of the human sheeple with your treacherous tech stack, even if it results in the long term degradation of everything you don't care about.

      • (Score: 2) by takyon on Monday January 27 2020, @04:22PM

        by takyon (881) <takyonNO@SPAMsoylentnews.org> on Monday January 27 2020, @04:22PM (#949380) Journal
        --
        [SIG] 10/28/2017: Soylent Upgrade v14 [soylentnews.org]
    • (Score: 3, Informative) by The Mighty Buzzard on Monday January 27 2020, @04:29PM (3 children)

      by The Mighty Buzzard (18) Subscriber Badge <themightybuzzard@proton.me> on Monday January 27 2020, @04:29PM (#949383) Homepage Journal

      If you think about the energy wasted by these extra labrythine functions, that are also now known to be spyware also, in every computer, that it is mind boggling.

      Speak for yourself. I still use a Phenom II x6 1090t. And I will continue to until I find a chip that I can disable the rootkit on that's reasonably priced and runs better than what I have.

      --
      My rights don't end where your fear begins.
      • (Score: 3, Interesting) by JoeMerchant on Monday January 27 2020, @05:36PM

        by JoeMerchant (3937) on Monday January 27 2020, @05:36PM (#949416)

        I preferred the Motorola 6800 series, you literally could hand-trace each assembly instruction from power on all the way through your app, and they can do some pretty heavy duty stuff - for instance: Caterpillar heavy equipment used them as control systems at least up through the 2000s.

        --
        🌻🌻 [google.com]
      • (Score: 2) by KilroySmith on Monday January 27 2020, @08:09PM (1 child)

        by KilroySmith (2113) on Monday January 27 2020, @08:09PM (#949522)

        I thought I was the only one left - Phenom II X4 805 for me.
        Even does a decent job of playing DOOM 2016 when I swap in a good graphics card.

    • (Score: 2) by JoeMerchant on Monday January 27 2020, @05:31PM

      by JoeMerchant (3937) on Monday January 27 2020, @05:31PM (#949412)

      I propose we change this first

      Sure, while we're at it, let's also fix politics, medicine, corruption in the legal system, and the perpetuation of war around the globe by the military industrial complex.

      Meanwhile, baby steps are at least possible, and even if we're optimizing a buggy whip - it's a buggy whip that's going to have trillions of dollars per year economic impact for the forseeable future (meaning: at least 5 years).

      Like my Physics prof said: electric power transformers aren't exciting nor particularly mysterious, but if you can figure out a tweak that improves their efficiency just a fraction of a percent - that tweak can have tremendous impact due to their widespread heavy use.

      --
      🌻🌻 [google.com]
  • (Score: 2) by JoeMerchant on Monday January 27 2020, @03:34PM

    by JoeMerchant (3937) on Monday January 27 2020, @03:34PM (#949356)

    There's a Masters' Thesis from 1990, typed in Word Perfect with graphics hand pasted in-between the text, printed on the required cotton-bond paper and bound - hopefully still collecting dust in the University of Miami Library, which describes a Parallel Real Time System for Music synthesis which does, basically this (described in OP). The "synthesis blocks" are described as generic computational algorithms which get their inputs across a communication matrix, take some integer number of cycles to compute their result(s), and these results become outputs communicated across the communication matrix to the next stage of computational nodes.

    Purdue University was working on similar concepts at the time which they dubbed "systolic computing" and published fairly widely. It's remarkable how much of this material has not been carried forward into the searchable web. I see that some of the earliest work came out of Harvard in the early 1980s... https://en.wikipedia.org/wiki/Systolic_array#cite_note-1 [wikipedia.org]

    --
    🌻🌻 [google.com]
  • (Score: 2) by bzipitidoo on Monday January 27 2020, @05:02PM (1 child)

    by bzipitidoo (4388) on Monday January 27 2020, @05:02PM (#949397) Journal

    Using a neural net to tune the code for a Von Neumann architecture strikes me as similar to using electricity, and computers, to make combustion engines run better. Or, maybe like using genetic engineering to breed faster horses. Yeah, it works. But a combustion engine is vastly inferior to an electric motor. For a lot of tasks, a neural net may be better than a Von Neumann machine tuned by a neural net.

    Still, it's pretty cool. I often wonder how a medieval knight with armor and weapons forged using modern knowledge and technology would fare in battle. Titanium plate armor and sword ought to give the knight a pretty big advantage against opposing knights using standard medieval tech. Moot though, as much of the same tech makes a joke of fighting on horseback with swords and lances and such like weapons.

    • (Score: 2) by JoeMerchant on Monday January 27 2020, @05:26PM

      by JoeMerchant (3937) on Monday January 27 2020, @05:26PM (#949408)

      If you're in the world of top fuel dragsters, computer controlled fuel injection and spark ignition not only follow the rules, but also work with the highly developed widely available high output engines.

      Of course you can build a super-capacitor + electric motor based competitor, it has been done, but it's apples and oranges, and at least for a while the world is going to have both.

      --
      🌻🌻 [google.com]
  • (Score: 4, Informative) by Rich on Monday January 27 2020, @11:09PM (2 children)

    by Rich (945) on Monday January 27 2020, @11:09PM (#949649) Journal

    The modern "wide & deep" monster CPUs (like 16 pipeline stages, 4 execution units) more or less remove the need for keyhole optimization, so that's not an area where I would throw the tools at. This is one of a few things I learned a decade ago when I wrote a MIPS-like-CPU-to-x86 dynamic recompiler for simulation of a complex robotic device with about 16 of these CPUs. I wanted to achieve better-than-realtime on a single off-the-shelf PC, so I had benchmarks for an estimate. While basic algorithmic changes showed up in the benchmarks, fine tuning hardly made a difference. The CPU fills all its pipeline slots as good as the fundamental algorithmic hazards allow and doesn't really care about a bit of low-level inefficiency here or there. Going to off-chip data is what slows you down (*). The IPC gains of newer generations (*lake, *zen) come from optimizing that further, so cycle counting (like doing the 32usec nibble loop for the original Woz machine) doesn't really apply anymore once you go significantly past an AVR.

    Where the described tools might come in handy is an estimate whether your new algorithm for widespread distribution (used to be video codecs, probably some neural net stuff today) will perform to requirements on a sufficient percentage of the target machinery, which is very diverse with general use PCs.

    (*) this is why the C++ policy of "zero overhead" can easily make things worse with templates. Unless the duplicated instantiations are really small, it's faster to have flexible code that does a few overheady address computations (which usually get parallelized away by the CPU anyway) than to have many times the code perfectly matched to the data, but thrashing the cache.

    • (Score: 2, Informative) by YeaWhatevs on Tuesday January 28 2020, @12:51PM (1 child)

      by YeaWhatevs (5623) on Tuesday January 28 2020, @12:51PM (#950036)

      I have found the same thing. I recently rewrote some C++ code as C after I had it working in order to consolidate template code. It was highly annoying to rewrite and debug, I wouldn't reccommend it, but in the end it cut down the executable size by about 2/3 and the time spent executing approximately the same.

      • (Score: 2) by Rich on Wednesday January 29 2020, @12:16PM

        by Rich (945) on Wednesday January 29 2020, @12:16PM (#950586) Journal

        What you might do is have a "C-with-classes" (or "EC++") like base layer that does everything with "void*" and put a thin template over it, that generates no, or only very little code, for type safety. I did that a quarter century ago, when the promised "the linker will fold identical template code (i.e. for typed pointers)" was still about 15 years away. Not sure how the technique might interact with the abominations that increasingly "modern C++" brought, though.

(1)