This week Google released a report detailing the design and performance characteristics of the Tensor Processing Unit (TPU), its custom ASIC for the inference phase of neural networks (NN). Google has been using the machine learning accelerator in its datacenters since 2015, but hasn't said much about the hardware until now.
In a blog post published yesterday (April 5, 2017), Norm Jouppi, distinguished hardware engineer at Google, observes, "The need for TPUs really emerged about six years ago, when we started using computationally expensive deep learning models in more and more places throughout our products. The computational expense of using these models had us worried. If we considered a scenario where people use Google voice search for just three minutes a day and we ran deep neural nets for our speech recognition system on the processing units we were using, we would have had to double the number of Google data centers!"
The paper, "In-Datacenter Performance Analysis of a Tensor Processing Unit​," (the joint effort of more than 70 authors) describes the TPU thusly:
"The heart of the TPU is a 65,536 8-bit MAC matrix multiply unit that offers a peak throughput of 92 TeraOps/second (TOPS) and a large (28 MiB) software-managed on-chip memory. The TPU's deterministic execution model is a better match to the 99th-percentile response-time requirement of our NN applications than are the time-varying optimizations of CPUs and GPUs (caches, out-of-order execution, multithreading, multiprocessing, prefetching, ...) that help average throughput more than guaranteed latency. The lack of such features helps explain why, despite having myriad MACs and a big memory, the TPU is relatively small and low power."
(Score: 4, Insightful) by fishybell on Sunday April 09 2017, @12:20AM (4 children)
I'm sure as more and more large companies with large datacenters start looking at these results you'll see more of them jump on the ASIC bandwagon. Given a large enough requirement for the same type of operation over and over, ASICs will always win out in the long run. We've seen it with Bitcoin, and now we're seeing it with datacenters. The fact that it's doing neural-net calculations is completely irrelevant.
(Score: 1, Redundant) by Ethanol-fueled on Sunday April 09 2017, @01:01AM (1 child)
In case anybody asks why ASICs aren't used more, it's because they're way expensive compared to, say, an FPGA or CPLD. It wouldn't make sense to order just 20 of them unless you're fucking Google and possess the required Jew Golds.
(Score: 0) by Anonymous Coward on Tuesday April 11 2017, @05:07PM
Why are you so afraid of jews to the point where you have to spout tired cartman-esque nonsense? What, they killed your family and burnt down your village or something?
(Score: 2) by RamiK on Sunday April 09 2017, @08:44AM
The fact that it's doing neural-net calculations is completely irrelevant.
Yup. Once everyone sees how efficient my orthodontist's distal cutters are, they'll all want their own specialized, custom-made tools. CNC always wins out in the long run.
compiling...
(Score: 2) by kaszz on Sunday April 09 2017, @06:49PM
So how much does a ASIC cost these days?
Say 130 nm process, 2 million transistors?
(Score: 2) by takyon on Sunday April 09 2017, @12:32AM
Google sees Domain specific custom chips as the future with chips 200 times or more faster than Intel chips [nextbigfuture.com]
[SIG] 10/28/2017: Soylent Upgrade v14 [soylentnews.org]
(Score: 2) by Snotnose on Sunday April 09 2017, @01:59AM (2 children)
When I came of age, think early 80's, chip advances were in the "how low can they go". I had several discussions at trade shows along the lines of "they're at the human hair level, can't get much smaller". "They're at 10 atoms, how small can they go?".
Now it's "costs too much to shrink the die, hmm, lets optimize the CPU for different work loads".
Not really seeing a problem in that, you can only shrink transistors so far until architecture takes over.
When the dust settled America realized it was saved by a porn star.
(Score: 2) by takyon on Sunday April 09 2017, @02:09AM
Look at the TPU's lower power consumption, or the even lower power consumption of neuromorphic (NPU?) chip designs. These are just begging to be scaled vertically, and could lead to orders of magnitude better performance for their niche.
[SIG] 10/28/2017: Soylent Upgrade v14 [soylentnews.org]
(Score: 2) by kaszz on Sunday April 09 2017, @05:43PM
Rather about which approach will be the lowest hanging fruit for the time being. If it's cheaper to just pack more transistors and increase the frequency, then that will be done. If architectural approaches means more for the bang-vs-buck factor. Then that is what will be done.
(Score: 0) by Anonymous Coward on Sunday April 09 2017, @02:11PM
that paper is well written, but unless you are already in the field, it would be a research project to figure out what it says.
For a description of what they built, how about a nice C implementation that runs slowly, but simply?