This week Google released a report detailing the design and performance characteristics of the Tensor Processing Unit (TPU) [hpcwire.com], its custom ASIC for the inference phase of neural networks (NN). Google has been using the machine learning accelerator in its datacenters since 2015, but hasn’t said much about the hardware until now.
In a blog post [googleblog.com] published yesterday (April 5, 2017), Norm Jouppi, distinguished hardware engineer at Google, observes, “The need for TPUs really emerged about six years ago, when we started using computationally expensive deep learning models in more and more places throughout our products. The computational expense of using these models had us worried. If we considered a scenario where people use Google voice search for just three minutes a day and we ran deep neural nets for our speech recognition system on the processing units we were using, we would have had to double the number of Google data centers!”
The paper, “In-Datacenter Performance Analysis of a Tensor Processing Unit,” [google.com] (the joint effort of more than 70 authors) describes the TPU thusly:
“The heart of the TPU is a 65,536 8-bit MAC matrix multiply unit that offers a peak throughput of 92 TeraOps/second (TOPS) and a large (28 MiB) software-managed on-chip memory. The TPU’s deterministic execution model is a better match to the 99th-percentile response-time requirement of our NN applications than are the time-varying optimizations of CPUs and GPUs (caches, out-of-order execution, multithreading, multiprocessing, prefetching, …) that help average throughput more than guaranteed latency. The lack of such features helps explain why, despite having myriad MACs and a big memory, the TPU is relatively small and low power.”