from the if-spammers-used-'em-would-we-have-phish-and-chips? dept.
This week Google released a report detailing the design and performance characteristics of the Tensor Processing Unit (TPU), its custom ASIC for the inference phase of neural networks (NN). Google has been using the machine learning accelerator in its datacenters since 2015, but hasn't said much about the hardware until now.
In a blog post published yesterday (April 5, 2017), Norm Jouppi, distinguished hardware engineer at Google, observes, "The need for TPUs really emerged about six years ago, when we started using computationally expensive deep learning models in more and more places throughout our products. The computational expense of using these models had us worried. If we considered a scenario where people use Google voice search for just three minutes a day and we ran deep neural nets for our speech recognition system on the processing units we were using, we would have had to double the number of Google data centers!"
The paper, "In-Datacenter Performance Analysis of a Tensor Processing Unit," (the joint effort of more than 70 authors) describes the TPU thusly:
"The heart of the TPU is a 65,536 8-bit MAC matrix multiply unit that offers a peak throughput of 92 TeraOps/second (TOPS) and a large (28 MiB) software-managed on-chip memory. The TPU's deterministic execution model is a better match to the 99th-percentile response-time requirement of our NN applications than are the time-varying optimizations of CPUs and GPUs (caches, out-of-order execution, multithreading, multiprocessing, prefetching, ...) that help average throughput more than guaranteed latency. The lack of such features helps explain why, despite having myriad MACs and a big memory, the TPU is relatively small and low power."
Following Google's release of a paper detailing how its tensor processing units (TPUs) beat 2015 CPUs and GPUs at machine learning inference tasks, Nvidia has countered with results from its Tesla P40:
Google's TPU went online in 2015, which is why the company compared its performance against other chips that it was using at that time in its data centers, such as the Nvidia Tesla K80 GPU and the Intel Haswell CPU.
Google is only now releasing the results, possibly because it doesn't want other machine learning competitors (think Microsoft, rather than Nvidia or Intel) to learn about the secrets that make its AI so advanced, at least until it's too late to matter. Releasing the TPU results now could very well mean Google is already testing or even deploying its next-generation TPU.
Nevertheless, Nvidia took the opportunity to show that its latest inference GPUs, such as the Tesla P40, have evolved significantly since then, too. Some of the increase in inference performance seen by Nvidia GPUs is due to the company jumping from the previous 28nm process node to the 16nm FinFET node. This jump offered its chips about twice as much performance per Watt.
Nvidia also further improved its GPU architecture for deep learning in Maxwell, and then again in Pascal. Yet another reason for why the new GPU is so much faster for inferencing is that Nvidia's deep learning and inference-optimized software has improved significantly as well.
Finally, perhaps the main reason for why the Tesla P40 can be up to 26x faster than the old Tesla K80, according to Nvidia, is because the Tesla P40 supports INT8 computation, as opposed to the FP32-only support for the K80. Inference doesn't need too high accuracy when doing calculations and 8-bit integers seem to be enough for most types of neural networks.
Google's TPUs use less power, have an unknown cost (the P40 can cost $5,700), and may have advanced considerably since 2015.
Update, 9 June 2021: Google reports this week in the journal Nature that its next generation AI chip, succeeding the TPU version 4, was designed in part using an AI that researchers described to IEEE Spectrum last year. They've made some improvements since Spectrum last spoke to them. The AI now needs fewer than six hours to generate chip floorplans that match or beat human-produced designs at power consumption, performance, and area. Expert humans typically need months of iteration to do this task.
Original blog post from 23 March 2020 follows:
There's been a lot of intense and well-funded work developing chips that are specially designed to perform AI algorithms faster and more efficiently. The trouble is that it takes years to design a chip, and the universe of machine learning algorithms moves a lot faster than that. Ideally you want a chip that's optimized to do today's AI, not the AI of two to five years ago. Google's solution: have an AI design the AI chip.
"We believe that it is AI itself that will provide the means to shorten the chip design cycle, creating a symbiotic relationship between hardware and AI, with each fueling advances in the other," they write in a paper describing the work that posted today to Arxiv.
"We have already seen that there are algorithms or neural network architectures that... don't perform as well on existing generations of accelerators, because the accelerators were designed like two years ago, and back then these neural nets didn't exist," says Azalia Mirhoseini, a senior research scientist at Google. "If we reduce the design cycle, we can bridge the gap."
1.) Azalia Mirhoseini, Anna Goldie, Mustafa Yazgan, et al. A graph placement methodology for fast chip design, Nature (DOI: 10.1038/s41586-021-03544-w)
2.) Anna Goldie, Azalia Mirhoseini. Placement Optimization with Deep Reinforcement Learning, (DOI: https://arxiv.org/abs/2003.08445)
Related: Google Reveals Homegrown "TPU" For Machine Learning
Google Pulls Back the Covers on Its First Machine Learning Chip
Hundred Petaflop Machine Learning Supercomputers Now Available on Google Cloud
Google Replaced Millions of Intel Xeons with its Own "Argos" Video Transcoding Units
Google's machine learning oriented chips have gotten an upgrade:
At Google I/O 2017, Google announced its next-generation machine learning chip, called the "Cloud TPU." The new TPU no longer does only inference--now it can also train neural networks.
[...] In last month's paper, Google hinted that a next-generation TPU could be significantly faster if certain modifications were made. The Cloud TPU seems to have have received some of those improvements. It's now much faster, and it can also do floating-point computation, which means it's suitable for training neural networks, too.
According to Google, the chip can achieve 180 teraflops of floating-point performance, which is six times more than Nvidia's latest Tesla V100 accelerator for FP16 half-precision computation. Even when compared against Nvidia's "Tensor Core" performance, the Cloud TPU is still 50% faster.
[...] Google will also donate access to 1,000 Cloud TPUs to top researchers under the TensorFlow Research Cloud program to see what people do with them.
Previously: Google Reveals Homegrown "TPU" For Machine Learning
Google Pulls Back the Covers on Its First Machine Learning Chip
Nvidia Compares Google's TPUs to the Tesla P40
NVIDIA's Volta Architecture Unveiled: GV100 and Tesla V100