Slash Boxes

SoylentNews is people

posted by martyb on Saturday April 08 2017, @11:23PM   Printer-friendly [Skip to comment(s)]
from the if-spammers-used-'em-would-we-have-phish-and-chips? dept.

This week Google released a report detailing the design and performance characteristics of the Tensor Processing Unit (TPU), its custom ASIC for the inference phase of neural networks (NN). Google has been using the machine learning accelerator in its datacenters since 2015, but hasn't said much about the hardware until now.

In a blog post published yesterday (April 5, 2017), Norm Jouppi, distinguished hardware engineer at Google, observes, "The need for TPUs really emerged about six years ago, when we started using computationally expensive deep learning models in more and more places throughout our products. The computational expense of using these models had us worried. If we considered a scenario where people use Google voice search for just three minutes a day and we ran deep neural nets for our speech recognition system on the processing units we were using, we would have had to double the number of Google data centers!"

The paper, "In-Datacenter Performance Analysis of a Tensor Processing Unit​," (the joint effort of more than 70 authors) describes the TPU thusly:

"The heart of the TPU is a 65,536 8-bit MAC matrix multiply unit that offers a peak throughput of 92 TeraOps/second (TOPS) and a large (28 MiB) software-managed on-chip memory. The TPU's deterministic execution model is a better match to the 99th-percentile response-time requirement of our NN applications than are the time-varying optimizations of CPUs and GPUs (caches, out-of-order execution, multithreading, multiprocessing, prefetching, ...) that help average throughput more than guaranteed latency. The lack of such features helps explain why, despite having myriad MACs and a big memory, the TPU is relatively small and low power."

Original Submission

Related Stories

Nvidia Compares Google's TPUs to the Tesla P40 4 comments

Following Google's release of a paper detailing how its tensor processing units (TPUs) beat 2015 CPUs and GPUs at machine learning inference tasks, Nvidia has countered with results from its Tesla P40:

Google's TPU went online in 2015, which is why the company compared its performance against other chips that it was using at that time in its data centers, such as the Nvidia Tesla K80 GPU and the Intel Haswell CPU.

Google is only now releasing the results, possibly because it doesn't want other machine learning competitors (think Microsoft, rather than Nvidia or Intel) to learn about the secrets that make its AI so advanced, at least until it's too late to matter. Releasing the TPU results now could very well mean Google is already testing or even deploying its next-generation TPU.

Nevertheless, Nvidia took the opportunity to show that its latest inference GPUs, such as the Tesla P40, have evolved significantly since then, too. Some of the increase in inference performance seen by Nvidia GPUs is due to the company jumping from the previous 28nm process node to the 16nm FinFET node. This jump offered its chips about twice as much performance per Watt.

Nvidia also further improved its GPU architecture for deep learning in Maxwell, and then again in Pascal. Yet another reason for why the new GPU is so much faster for inferencing is that Nvidia's deep learning and inference-optimized software has improved significantly as well.

Finally, perhaps the main reason for why the Tesla P40 can be up to 26x faster than the old Tesla K80, according to Nvidia, is because the Tesla P40 supports INT8 computation, as opposed to the FP32-only support for the K80. Inference doesn't need too high accuracy when doing calculations and 8-bit integers seem to be enough for most types of neural networks.

Google's TPUs use less power, have an unknown cost (the P40 can cost $5,700), and may have advanced considerably since 2015.

Previously: Google Reveals Homegrown "TPU" For Machine Learning

Original Submission

Update: Google Used a New AI to Design Its Next AI Chip 11 comments

Update: Google Used a New AI to Design Its Next AI Chip

Update, 9 June 2021: Google reports this week in the journal Nature that its next generation AI chip, succeeding the TPU version 4, was designed in part using an AI that researchers described to IEEE Spectrum last year. They've made some improvements since Spectrum last spoke to them. The AI now needs fewer than six hours to generate chip floorplans that match or beat human-produced designs at power consumption, performance, and area. Expert humans typically need months of iteration to do this task.

Original blog post from 23 March 2020 follows:

There's been a lot of intense and well-funded work developing chips that are specially designed to perform AI algorithms faster and more efficiently. The trouble is that it takes years to design a chip, and the universe of machine learning algorithms moves a lot faster than that. Ideally you want a chip that's optimized to do today's AI, not the AI of two to five years ago. Google's solution: have an AI design the AI chip.

"We believe that it is AI itself that will provide the means to shorten the chip design cycle, creating a symbiotic relationship between hardware and AI, with each fueling advances in the other," they write in a paper describing the work that posted today to Arxiv.

"We have already seen that there are algorithms or neural network architectures that... don't perform as well on existing generations of accelerators, because the accelerators were designed like two years ago, and back then these neural nets didn't exist," says Azalia Mirhoseini, a senior research scientist at Google. "If we reduce the design cycle, we can bridge the gap."

Journal References:
1.) Azalia Mirhoseini, Anna Goldie, Mustafa Yazgan, et al. A graph placement methodology for fast chip design, Nature (DOI: 10.1038/s41586-021-03544-w)
2.) Anna Goldie, Azalia Mirhoseini. Placement Optimization with Deep Reinforcement Learning, (DOI:

Related: Google Reveals Homegrown "TPU" For Machine Learning
Google Pulls Back the Covers on Its First Machine Learning Chip
Hundred Petaflop Machine Learning Supercomputers Now Available on Google Cloud
Google Replaced Millions of Intel Xeons with its Own "Argos" Video Transcoding Units

Original Submission

Google's New TPUs are Now Much Faster -- will be Made Available to Researchers 20 comments

Google's machine learning oriented chips have gotten an upgrade:

At Google I/O 2017, Google announced its next-generation machine learning chip, called the "Cloud TPU." The new TPU no longer does only inference--now it can also train neural networks.

[...] In last month's paper, Google hinted that a next-generation TPU could be significantly faster if certain modifications were made. The Cloud TPU seems to have have received some of those improvements. It's now much faster, and it can also do floating-point computation, which means it's suitable for training neural networks, too.

According to Google, the chip can achieve 180 teraflops of floating-point performance, which is six times more than Nvidia's latest Tesla V100 accelerator for FP16 half-precision computation. Even when compared against Nvidia's "Tensor Core" performance, the Cloud TPU is still 50% faster.

[...] Google will also donate access to 1,000 Cloud TPUs to top researchers under the TensorFlow Research Cloud program to see what people do with them.

Also at EETimes and Google.

Previously: Google Reveals Homegrown "TPU" For Machine Learning
Google Pulls Back the Covers on Its First Machine Learning Chip
Nvidia Compares Google's TPUs to the Tesla P40
NVIDIA's Volta Architecture Unveiled: GV100 and Tesla V100

Original Submission

This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 4, Insightful) by fishybell on Sunday April 09 2017, @12:20AM (4 children)

    by fishybell (3156) Subscriber Badge on Sunday April 09 2017, @12:20AM (#491015)

    I'm sure as more and more large companies with large datacenters start looking at these results you'll see more of them jump on the ASIC bandwagon. Given a large enough requirement for the same type of operation over and over, ASICs will always win out in the long run. We've seen it with Bitcoin, and now we're seeing it with datacenters. The fact that it's doing neural-net calculations is completely irrelevant.

    • (Score: 1, Redundant) by Ethanol-fueled on Sunday April 09 2017, @01:01AM (1 child)

      by Ethanol-fueled (2792) on Sunday April 09 2017, @01:01AM (#491023) Homepage

      In case anybody asks why ASICs aren't used more, it's because they're way expensive compared to, say, an FPGA or CPLD. It wouldn't make sense to order just 20 of them unless you're fucking Google and possess the required Jew Golds.

      • (Score: 0) by Anonymous Coward on Tuesday April 11 2017, @05:07PM

        by Anonymous Coward on Tuesday April 11 2017, @05:07PM (#492358)

        Why are you so afraid of jews to the point where you have to spout tired cartman-esque nonsense? What, they killed your family and burnt down your village or something?

    • (Score: 2) by RamiK on Sunday April 09 2017, @08:44AM

      by RamiK (1813) on Sunday April 09 2017, @08:44AM (#491121)

      The fact that it's doing neural-net calculations is completely irrelevant.

      Yup. Once everyone sees how efficient my orthodontist's distal cutters are, they'll all want their own specialized, custom-made tools. CNC always wins out in the long run.

    • (Score: 2) by kaszz on Sunday April 09 2017, @06:49PM

      by kaszz (4211) on Sunday April 09 2017, @06:49PM (#491227) Journal

      So how much does a ASIC cost these days?

      Say 130 nm process, 2 million transistors?

  • (Score: 2) by takyon on Sunday April 09 2017, @12:32AM

    by takyon (881) <reversethis-{gro ... s} {ta} {noykat}> on Sunday April 09 2017, @12:32AM (#491018) Journal

    Google sees Domain specific custom chips as the future with chips 200 times or more faster than Intel chips []

    The TPU die leverages its advantage in MACs and on-chip memory to run short programs written using the domain-specific TensorFlow framework 15 times as fast as the K80 GPU die, resulting in a performance/Watt advantage of 29 times, which is correlated with performance/total cost of ownership. Compared to the Haswell CPU die, the corresponding ratios are 29 and 83. While future CPUs and GPUs will surely run inference faster, a redesigned TPU using circa 2015 GPU memory would go two to three times as fast and boost the performance/Watt advantage of nearly 70 over the K80 and 200 over Haswell.

    [SIG] 10/28/2017: Soylent Upgrade v14 []
  • (Score: 2) by Snotnose on Sunday April 09 2017, @01:59AM (2 children)

    by Snotnose (1623) on Sunday April 09 2017, @01:59AM (#491039)

    When I came of age, think early 80's, chip advances were in the "how low can they go". I had several discussions at trade shows along the lines of "they're at the human hair level, can't get much smaller". "They're at 10 atoms, how small can they go?".

    Now it's "costs too much to shrink the die, hmm, lets optimize the CPU for different work loads".

    Not really seeing a problem in that, you can only shrink transistors so far until architecture takes over.

    I fondly remember the day I made sandcastles with my grandmother. Just wish I hadn't done it in the crematorium.
    • (Score: 2) by takyon on Sunday April 09 2017, @02:09AM

      by takyon (881) <reversethis-{gro ... s} {ta} {noykat}> on Sunday April 09 2017, @02:09AM (#491044) Journal

      Look at the TPU's lower power consumption, or the even lower power consumption of neuromorphic (NPU?) chip designs. These are just begging to be scaled vertically, and could lead to orders of magnitude better performance for their niche.

      [SIG] 10/28/2017: Soylent Upgrade v14 []
    • (Score: 2) by kaszz on Sunday April 09 2017, @05:43PM

      by kaszz (4211) on Sunday April 09 2017, @05:43PM (#491207) Journal

      Rather about which approach will be the lowest hanging fruit for the time being. If it's cheaper to just pack more transistors and increase the frequency, then that will be done. If architectural approaches means more for the bang-vs-buck factor. Then that is what will be done.

  • (Score: 0) by Anonymous Coward on Sunday April 09 2017, @02:11PM

    by Anonymous Coward on Sunday April 09 2017, @02:11PM (#491163)

    that paper is well written, but unless you are already in the field, it would be a research project to figure out what it says.

    For a description of what they built, how about a nice C implementation that runs slowly, but simply?