from the Are-you-thinking-what-I'm-thinking? dept.
Google's machine learning oriented chips have gotten an upgrade:
At Google I/O 2017, Google announced its next-generation machine learning chip, called the "Cloud TPU." The new TPU no longer does only inference--now it can also train neural networks.
[...] In last month's paper, Google hinted that a next-generation TPU could be significantly faster if certain modifications were made. The Cloud TPU seems to have have received some of those improvements. It's now much faster, and it can also do floating-point computation, which means it's suitable for training neural networks, too.
According to Google, the chip can achieve 180 teraflops of floating-point performance, which is six times more than Nvidia's latest Tesla V100 accelerator for FP16 half-precision computation. Even when compared against Nvidia's "Tensor Core" performance, the Cloud TPU is still 50% faster.
[...] Google will also donate access to 1,000 Cloud TPUs to top researchers under the TensorFlow Research Cloud program to see what people do with them.
Previously: Google Reveals Homegrown "TPU" For Machine Learning
Google Pulls Back the Covers on Its First Machine Learning Chip
Nvidia Compares Google's TPUs to the Tesla P40
NVIDIA's Volta Architecture Unveiled: GV100 and Tesla V100
[We] started a stealthy project at Google several years ago to see what we could accomplish with our own custom accelerators for machine learning applications. The result is called a Tensor Processing Unit (TPU), a custom ASIC we built specifically for machine learning — and tailored for TensorFlow. We've been running TPUs inside our data centers for more than a year, and have found them to deliver an order of magnitude better-optimized performance per watt for machine learning. This is roughly equivalent to fast-forwarding technology about seven years into the future (three generations of Moore's Law). [...] TPU is an example of how fast we turn research into practice — from first tested silicon, the team
had them up and running applications at speed in our data centers within 22 days.
The processors are already being used to improve search and Street View, and were used to power AlphaGo during its matches against Go champion Lee Sedol. More details can be found at Next Platform, Tom's Hardware, and AnandTech.
This week Google released a report detailing the design and performance characteristics of the Tensor Processing Unit (TPU), its custom ASIC for the inference phase of neural networks (NN). Google has been using the machine learning accelerator in its datacenters since 2015, but hasn't said much about the hardware until now.
In a blog post published yesterday (April 5, 2017), Norm Jouppi, distinguished hardware engineer at Google, observes, "The need for TPUs really emerged about six years ago, when we started using computationally expensive deep learning models in more and more places throughout our products. The computational expense of using these models had us worried. If we considered a scenario where people use Google voice search for just three minutes a day and we ran deep neural nets for our speech recognition system on the processing units we were using, we would have had to double the number of Google data centers!"
The paper, "In-Datacenter Performance Analysis of a Tensor Processing Unit," (the joint effort of more than 70 authors) describes the TPU thusly:
"The heart of the TPU is a 65,536 8-bit MAC matrix multiply unit that offers a peak throughput of 92 TeraOps/second (TOPS) and a large (28 MiB) software-managed on-chip memory. The TPU's deterministic execution model is a better match to the 99th-percentile response-time requirement of our NN applications than are the time-varying optimizations of CPUs and GPUs (caches, out-of-order execution, multithreading, multiprocessing, prefetching, ...) that help average throughput more than guaranteed latency. The lack of such features helps explain why, despite having myriad MACs and a big memory, the TPU is relatively small and low power."
Following Google's release of a paper detailing how its tensor processing units (TPUs) beat 2015 CPUs and GPUs at machine learning inference tasks, Nvidia has countered with results from its Tesla P40:
Google's TPU went online in 2015, which is why the company compared its performance against other chips that it was using at that time in its data centers, such as the Nvidia Tesla K80 GPU and the Intel Haswell CPU.
Google is only now releasing the results, possibly because it doesn't want other machine learning competitors (think Microsoft, rather than Nvidia or Intel) to learn about the secrets that make its AI so advanced, at least until it's too late to matter. Releasing the TPU results now could very well mean Google is already testing or even deploying its next-generation TPU.
Nevertheless, Nvidia took the opportunity to show that its latest inference GPUs, such as the Tesla P40, have evolved significantly since then, too. Some of the increase in inference performance seen by Nvidia GPUs is due to the company jumping from the previous 28nm process node to the 16nm FinFET node. This jump offered its chips about twice as much performance per Watt.
Nvidia also further improved its GPU architecture for deep learning in Maxwell, and then again in Pascal. Yet another reason for why the new GPU is so much faster for inferencing is that Nvidia's deep learning and inference-optimized software has improved significantly as well.
Finally, perhaps the main reason for why the Tesla P40 can be up to 26x faster than the old Tesla K80, according to Nvidia, is because the Tesla P40 supports INT8 computation, as opposed to the FP32-only support for the K80. Inference doesn't need too high accuracy when doing calculations and 8-bit integers seem to be enough for most types of neural networks.
Google's TPUs use less power, have an unknown cost (the P40 can cost $5,700), and may have advanced considerably since 2015.
NVIDIA has detailed the full GV100 GPU as well as the first product based on the GPU, the Tesla V100:
The Volta GV100 GPU uses the 12nm TSMC FFN process, has over 21 billion transistors, and is designed for deep learning applications. We're talking about an 815mm2 die here, which pushes the limits of TSMC's current capabilities. Nvidia said it's not possible to build a larger GPU on the current process technology. The GP100 was the largest GPU that Nvidia ever produced before the GV100. It took up a 610mm2 surface area and housed 15.3 billion transistors. The GV100 is more than 30% larger.
Volta's full GV100 GPU sports 84 SMs (each SM [streaming multiprocessor] features four texture units, 64 FP32 cores, 64 INT32 cores, 32 FP64 cores) fed by 128KB of shared L1 cache per SM that can be configured to varying texture cache and shared memory ratios. The GP100 featured 60 SMs and a total of 3840 CUDA cores. The Volta SMs also feature a new type of core that specializes in Tensor deep learning 4x4 Matrix operations. The GV100 contains eight Tensor cores per SM and deliver a total of 120 TFLOPS for training and inference operations. To save you some math, this brings the full GV100 GPU to an impressive 5,376 FP32 and INT32 cores, 2688 FP64 cores, and 336 texture units.
[...] GV100 also features four HBM2 memory emplacements, like GP100, with each stack controlled by a pair of memory controllers. Speaking of which, there are eight 512-bit memory controllers (giving this GPU a total memory bus width of 4,096-bit). Each memory controller is attached to 768KB of L2 cache, for a total of 6MB of L2 cache (vs 4MB for Pascal).
The Tesla V100 has 16 GB of HBM2 memory with 900 GB/s of memory bandwidth. NVLink interconnect bandwidth has been increased to 300 GB/s.
Note the "120 TFLOPS" for machine learning operations. Microsoft is "doubling down" on AI, and NVIDIA's sales to data centers have tripled in a year. Sales of automotive-oriented GPUs (more machine learning) also increased.
Apple is working on a processor devoted specifically to AI-related tasks, according to a person familiar with the matter. The chip, known internally as the Apple Neural Engine, would improve the way the company's devices handle tasks that would otherwise require human intelligence -- such as facial recognition and speech recognition, said the person, who requested anonymity discussing a product that hasn't been made public. Apple declined to comment.
[...] Apple devices currently handle complex artificial intelligence processes with two different chips: the main processor and the graphics chip. The new chip would let Apple offload those tasks onto a dedicated module designed specifically for demanding artificial intelligence processing, allowing Apple to improve battery performance.
Should Apple bring the chip out of testing and development, it would follow other semiconductor makers that have already introduced dedicated AI chips. Qualcomm Inc.'s latest Snapdragon chip for smartphones has a module for handling artificial intelligence tasks, while Google announced its first chip, called the Tensor Processing Unit (TPU), in 2016.
Google will supposedly put mini TPUs into smartphones in the coming years.
Google DeepMind researchers have made their old AlphaGo program obsolete:
The old AlphaGo relied on a computationally intensive Monte Carlo tree search to play through Go scenarios. The nodes and branches created a much larger tree than AlphaGo practically needed to play. A combination of reinforcement learning and human-supervised learning was used to build "value" and "policy" neural networks that used the search tree to execute gameplay strategies. The software learned from 30 million moves played in human-on-human games, and benefited from various bodges and tricks to learn to win. For instance, it was trained from master-level human players, rather than picking it up from scratch.
AlphaGo Zero did start from scratch with no experts guiding it. And it is much more efficient: it only uses a single computer and four of Google's custom TPU1 chips to play matches, compared to AlphaGo's several machines and 48 TPUs. Since Zero didn't rely on human gameplay, and a smaller number of matches, its Monte Carlo tree search is smaller. The self-play algorithm also combined both the value and policy neural networks into one, and was trained on 64 GPUs and 19 CPUs over a few days by playing nearly five million games against itself. In comparison, AlphaGo needed months of training and used 1,920 CPUs and 280 GPUs to beat Lee Sedol.
Though self-play AlphaGo Zero even discovered for itself, without human intervention, classic moves in the theory of Go, such as fuseki opening tactics, and what's called life and death. More details can be found in Nature, or from the paper directly here. Stanford computer science academic Bharath Ramsundar has a summary of the more technical points, here.
Go is an abstract strategy board game for two players, in which the aim is to surround more territory than the opponent.