Stories
Slash Boxes
Comments

SoylentNews is people

posted by martyb on Monday January 27 2020, @02:29PM   Printer-friendly

Tool predicts how fast code will run on a chip:

[...] In [a] series of conference papers, the researchers describe a novel machine-learning pipeline that automates this process, making it easier, faster, and more accurate. In a paper presented at the International Conference on Machine Learning in June, the researchers presented Ithemal, a neural-network model that trains on labeled data in the form of “basic blocks” — fundamental snippets of computing instructions — to automatically predict how long it takes a given chip to execute previously unseen basic blocks. Results suggest Ithemal performs far more accurately than traditional hand-tuned models.

Then, at the November IEEE International Symposium on Workload Characterization, the researchers presented a benchmark suite of basic blocks from a variety of domains, including machine learning, compilers, cryptography, and graphics that can be used to validate performance models. They pooled more than 300,000 of the profiled blocks into an open-source dataset called BHive. During their evaluations, Ithemal predicted how fast Intel chips would run code even better than a performance model built by Intel itself.

Ultimately, developers and compilers can use the tool to generate code that runs faster and more efficiently on an ever-growing number of diverse and “black box” chip designs. “Modern computer processors are opaque, horrendously complicated, and difficult to understand. It is also incredibly challenging to write computer code that executes as fast as possible for these processors,” says co-author on all three papers Michael Carbin, an assistant professor in the Department of Electrical Engineering and Computer Science (EECS) and a researcher in the Computer Science and Artificial Intelligence Laboratory (CSAIL). “This tool is a big step forward toward fully modeling the performance of these chips for improved efficiency.”

Most recently, in a paper presented at the NeurIPS conference in December, the team proposed a new technique to automatically generate compiler optimizations.  Specifically, they automatically generate an algorithm, called Vemal, that converts certain code into vectors, which can be used for parallel computing. Vemal outperforms hand-crafted vectorization algorithms used in the LLVM compiler — a popular compiler used in the industry.

[...] “Intel’s documents are neither error-free nor complete, and Intel will omit certain things, because it’s proprietary,” says co-author on all three papers Charith Mendis, a graduate student in EECS and CSAIL. “However, when you use data, you don’t need to know the documentation. If there’s something hidden you can learn it directly from the data.”

[...] In training, the Ithemal model analyzes millions of automatically profiled basic blocks to learn exactly how different chip architectures will execute computation. Importantly, Ithemal takes raw text as input and does not require manually adding features to the input data. In testing, Ithemal can be fed previously unseen basic blocks and a given chip, and will generate a single number indicating how fast the chip will execute that code.

The researchers found Ithemal cut error rates in accuracy — meaning the difference between the predicted speed versus real-world speed — by 50 percent over traditional hand-crafted models. Further, in their next paper, they showed that Ithemal’s error rate was 10 percent, while the Intel performance-prediction model’s error rate was 20 percent on a variety of basic blocks across multiple different domains.

Articles:

Charith Mendis, Alex Renda, Saman Amarasinghe, Michael Carbin. Ithemal: Accurate, Portable and Fast Basic Block Throughput Estimation using Deep Neural Networks. http://proceedings.mlr.press/v97/mendis19a/mendis19a.pdf

Yishen Chen, Ajay Brahmakshatriya, Charith Mendis, Alex Renda, Eric Atkinson, Ondrej Sykora, Saman Amarasinghe, and Michael Carbin. BHive: A Benchmark Suite and Measurement Framework for Validating x86-64 Basic Block Performance Models http://groups.csail.mit.edu/commit/papers/19/ithemal-measurement.pdf"

Charith Mendis, Cambridge Yang, Yewen Pu, Saman Amarasinghe, Michael Carbin. Compiler Auto-Vectorization with Imitation Learning http://papers.nips.cc/paper/9604-compiler-auto-vectorization-with-imitation-learning.pdf"


Original Submission

 
This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 4, Informative) by Rich on Monday January 27 2020, @11:09PM (2 children)

    by Rich (945) on Monday January 27 2020, @11:09PM (#949649) Journal

    The modern "wide & deep" monster CPUs (like 16 pipeline stages, 4 execution units) more or less remove the need for keyhole optimization, so that's not an area where I would throw the tools at. This is one of a few things I learned a decade ago when I wrote a MIPS-like-CPU-to-x86 dynamic recompiler for simulation of a complex robotic device with about 16 of these CPUs. I wanted to achieve better-than-realtime on a single off-the-shelf PC, so I had benchmarks for an estimate. While basic algorithmic changes showed up in the benchmarks, fine tuning hardly made a difference. The CPU fills all its pipeline slots as good as the fundamental algorithmic hazards allow and doesn't really care about a bit of low-level inefficiency here or there. Going to off-chip data is what slows you down (*). The IPC gains of newer generations (*lake, *zen) come from optimizing that further, so cycle counting (like doing the 32usec nibble loop for the original Woz machine) doesn't really apply anymore once you go significantly past an AVR.

    Where the described tools might come in handy is an estimate whether your new algorithm for widespread distribution (used to be video codecs, probably some neural net stuff today) will perform to requirements on a sufficient percentage of the target machinery, which is very diverse with general use PCs.

    (*) this is why the C++ policy of "zero overhead" can easily make things worse with templates. Unless the duplicated instantiations are really small, it's faster to have flexible code that does a few overheady address computations (which usually get parallelized away by the CPU anyway) than to have many times the code perfectly matched to the data, but thrashing the cache.

    Starting Score:    1  point
    Moderation   +2  
       Informative=2, Total=2
    Extra 'Informative' Modifier   0  
    Karma-Bonus Modifier   +1  

    Total Score:   4  
  • (Score: 2, Informative) by YeaWhatevs on Tuesday January 28 2020, @12:51PM (1 child)

    by YeaWhatevs (5623) on Tuesday January 28 2020, @12:51PM (#950036)

    I have found the same thing. I recently rewrote some C++ code as C after I had it working in order to consolidate template code. It was highly annoying to rewrite and debug, I wouldn't reccommend it, but in the end it cut down the executable size by about 2/3 and the time spent executing approximately the same.

    • (Score: 2) by Rich on Wednesday January 29 2020, @12:16PM

      by Rich (945) on Wednesday January 29 2020, @12:16PM (#950586) Journal

      What you might do is have a "C-with-classes" (or "EC++") like base layer that does everything with "void*" and put a thin template over it, that generates no, or only very little code, for type safety. I did that a quarter century ago, when the promised "the linker will fold identical template code (i.e. for typed pointers)" was still about 15 years away. Not sure how the technique might interact with the abominations that increasingly "modern C++" brought, though.