Slash Boxes

SoylentNews is people

posted by martyb on Friday May 19 2017, @12:34AM   Printer-friendly [Skip to comment(s)]
from the Are-you-thinking-what-I'm-thinking? dept.

Google's machine learning oriented chips have gotten an upgrade:

At Google I/O 2017, Google announced its next-generation machine learning chip, called the "Cloud TPU." The new TPU no longer does only inference--now it can also train neural networks.

[...] In last month's paper, Google hinted that a next-generation TPU could be significantly faster if certain modifications were made. The Cloud TPU seems to have have received some of those improvements. It's now much faster, and it can also do floating-point computation, which means it's suitable for training neural networks, too.

According to Google, the chip can achieve 180 teraflops of floating-point performance, which is six times more than Nvidia's latest Tesla V100 accelerator for FP16 half-precision computation. Even when compared against Nvidia's "Tensor Core" performance, the Cloud TPU is still 50% faster.

[...] Google will also donate access to 1,000 Cloud TPUs to top researchers under the TensorFlow Research Cloud program to see what people do with them.

Also at EETimes and Google.

Previously: Google Reveals Homegrown "TPU" For Machine Learning
Google Pulls Back the Covers on Its First Machine Learning Chip
Nvidia Compares Google's TPUs to the Tesla P40
NVIDIA's Volta Architecture Unveiled: GV100 and Tesla V100

Original Submission

Related Stories

Google Reveals Homegrown "TPU" For Machine Learning 20 comments

Google has lifted the lid off of an internal project to create custom application-specific integrated circuits (ASICs) for machine learning tasks. The result is what they are calling a "TPU":

[We] started a stealthy project at Google several years ago to see what we could accomplish with our own custom accelerators for machine learning applications. The result is called a Tensor Processing Unit (TPU), a custom ASIC we built specifically for machine learning — and tailored for TensorFlow. We've been running TPUs inside our data centers for more than a year, and have found them to deliver an order of magnitude better-optimized performance per watt for machine learning. This is roughly equivalent to fast-forwarding technology about seven years into the future (three generations of Moore's Law). [...] TPU is an example of how fast we turn research into practice — from first tested silicon, the team
had them up and running applications at speed in our data centers within 22 days.

The processors are already being used to improve search and Street View, and were used to power AlphaGo during its matches against Go champion Lee Sedol. More details can be found at Next Platform, Tom's Hardware, and AnandTech.

Original Submission

Google Pulls Back the Covers on Its First Machine Learning Chip 10 comments

This week Google released a report detailing the design and performance characteristics of the Tensor Processing Unit (TPU), its custom ASIC for the inference phase of neural networks (NN). Google has been using the machine learning accelerator in its datacenters since 2015, but hasn't said much about the hardware until now.

In a blog post published yesterday (April 5, 2017), Norm Jouppi, distinguished hardware engineer at Google, observes, "The need for TPUs really emerged about six years ago, when we started using computationally expensive deep learning models in more and more places throughout our products. The computational expense of using these models had us worried. If we considered a scenario where people use Google voice search for just three minutes a day and we ran deep neural nets for our speech recognition system on the processing units we were using, we would have had to double the number of Google data centers!"

The paper, "In-Datacenter Performance Analysis of a Tensor Processing Unit​," (the joint effort of more than 70 authors) describes the TPU thusly:

"The heart of the TPU is a 65,536 8-bit MAC matrix multiply unit that offers a peak throughput of 92 TeraOps/second (TOPS) and a large (28 MiB) software-managed on-chip memory. The TPU's deterministic execution model is a better match to the 99th-percentile response-time requirement of our NN applications than are the time-varying optimizations of CPUs and GPUs (caches, out-of-order execution, multithreading, multiprocessing, prefetching, ...) that help average throughput more than guaranteed latency. The lack of such features helps explain why, despite having myriad MACs and a big memory, the TPU is relatively small and low power."

Original Submission

Nvidia Compares Google's TPUs to the Tesla P40 4 comments

Following Google's release of a paper detailing how its tensor processing units (TPUs) beat 2015 CPUs and GPUs at machine learning inference tasks, Nvidia has countered with results from its Tesla P40:

Google's TPU went online in 2015, which is why the company compared its performance against other chips that it was using at that time in its data centers, such as the Nvidia Tesla K80 GPU and the Intel Haswell CPU.

Google is only now releasing the results, possibly because it doesn't want other machine learning competitors (think Microsoft, rather than Nvidia or Intel) to learn about the secrets that make its AI so advanced, at least until it's too late to matter. Releasing the TPU results now could very well mean Google is already testing or even deploying its next-generation TPU.

Nevertheless, Nvidia took the opportunity to show that its latest inference GPUs, such as the Tesla P40, have evolved significantly since then, too. Some of the increase in inference performance seen by Nvidia GPUs is due to the company jumping from the previous 28nm process node to the 16nm FinFET node. This jump offered its chips about twice as much performance per Watt.

Nvidia also further improved its GPU architecture for deep learning in Maxwell, and then again in Pascal. Yet another reason for why the new GPU is so much faster for inferencing is that Nvidia's deep learning and inference-optimized software has improved significantly as well.

Finally, perhaps the main reason for why the Tesla P40 can be up to 26x faster than the old Tesla K80, according to Nvidia, is because the Tesla P40 supports INT8 computation, as opposed to the FP32-only support for the K80. Inference doesn't need too high accuracy when doing calculations and 8-bit integers seem to be enough for most types of neural networks.

Google's TPUs use less power, have an unknown cost (the P40 can cost $5,700), and may have advanced considerably since 2015.

Previously: Google Reveals Homegrown "TPU" For Machine Learning

Original Submission

NVIDIA's Volta Architecture Unveiled: GV100 and Tesla V100 3 comments

NVIDIA has detailed the full GV100 GPU as well as the first product based on the GPU, the Tesla V100:

The Volta GV100 GPU uses the 12nm TSMC FFN process, has over 21 billion transistors, and is designed for deep learning applications. We're talking about an 815mm2 die here, which pushes the limits of TSMC's current capabilities. Nvidia said it's not possible to build a larger GPU on the current process technology. The GP100 was the largest GPU that Nvidia ever produced before the GV100. It took up a 610mm2 surface area and housed 15.3 billion transistors. The GV100 is more than 30% larger.

Volta's full GV100 GPU sports 84 SMs (each SM [streaming multiprocessor] features four texture units, 64 FP32 cores, 64 INT32 cores, 32 FP64 cores) fed by 128KB of shared L1 cache per SM that can be configured to varying texture cache and shared memory ratios. The GP100 featured 60 SMs and a total of 3840 CUDA cores. The Volta SMs also feature a new type of core that specializes in Tensor deep learning 4x4 Matrix operations. The GV100 contains eight Tensor cores per SM and deliver a total of 120 TFLOPS for training and inference operations. To save you some math, this brings the full GV100 GPU to an impressive 5,376 FP32 and INT32 cores, 2688 FP64 cores, and 336 texture units.

[...] GV100 also features four HBM2 memory emplacements, like GP100, with each stack controlled by a pair of memory controllers. Speaking of which, there are eight 512-bit memory controllers (giving this GPU a total memory bus width of 4,096-bit). Each memory controller is attached to 768KB of L2 cache, for a total of 6MB of L2 cache (vs 4MB for Pascal).

The Tesla V100 has 16 GB of HBM2 memory with 900 GB/s of memory bandwidth. NVLink interconnect bandwidth has been increased to 300 GB/s.

Note the "120 TFLOPS" for machine learning operations. Microsoft is "doubling down" on AI, and NVIDIA's sales to data centers have tripled in a year. Sales of automotive-oriented GPUs (more machine learning) also increased.

IBM Unveils New AI Software, Will Support Nvidia Volta

Also at AnandTech and HPCWire.

Original Submission

Apple Wants to Add Machine Learning Chips to Smartphone SoCs 14 comments

Apple is working on a processor devoted specifically to AI-related tasks, according to a person familiar with the matter. The chip, known internally as the Apple Neural Engine, would improve the way the company's devices handle tasks that would otherwise require human intelligence -- such as facial recognition and speech recognition, said the person, who requested anonymity discussing a product that hasn't been made public. Apple declined to comment.

[...] Apple devices currently handle complex artificial intelligence processes with two different chips: the main processor and the graphics chip. The new chip would let Apple offload those tasks onto a dedicated module designed specifically for demanding artificial intelligence processing, allowing Apple to improve battery performance.

Should Apple bring the chip out of testing and development, it would follow other semiconductor makers that have already introduced dedicated AI chips. Qualcomm Inc.'s latest Snapdragon chip for smartphones has a module for handling artificial intelligence tasks, while Google announced its first chip, called the Tensor Processing Unit (TPU), in 2016.

Google will supposedly put mini TPUs into smartphones in the coming years.

Google's New TPUs are Now Much Faster -- will be Made Available to Researchers

Original Submission

Google Announces Edge TPU 14 comments

Google unwraps its gateway drug: Edge TPU chips for IoT AI code; Custom ASICs make decisions on sensors as developers get hooked on ad giant's cloud

Google has designed a low-power version of its homegrown AI math accelerator, dubbed it the Edge TPU, and promised to ship it to developers by October. Announced at Google Next 2018 today, the ASIC is a cutdown edition of its Tensor Processing Unit (TPU) family of in-house-designed coprocessors. TPUs are used internally at Google to power its machine-learning-based services, or are rentable via its public cloud. These chips are specific[ally] designed for and used to train neural networks and perform inference.

Now the web giant has developed a cut-down inference-only version suitable for running in Internet-of-Things gateways. The idea is you have a bunch of sensors and devices in your home, factory, office, hospital, etc, connected to one of these gateways, which then connects to Google's backend services in the cloud for additional processing.

Inside the gateway is the Edge TPU, plus potentially a graphics processor, and a general-purpose application processor running Linux or Android and Google's Cloud IoT Edge software stack. This stack contains lightweight Tensorflow-based libraries and models that access the Edge TPU to perform AI tasks at high speed in hardware. This work can also be performed on the application CPU and GPU cores, if necessary. You can use your own custom models if you wish.

The stack ensures connections between the gateway and the backend are secure. If you wanted, you could train a neural network model using Google's Cloud TPUs and have the Edge TPUs perform inference locally.

Google announcement. Also at TechCrunch, CNBC, and CNET.

Related: Google's New TPUs are Now Much Faster -- will be Made Available to Researchers
Google Renting Access to Tensor Processing Units (TPUs)
Nvidia V100 GPUs and Google TPUv2 Chips Benchmarked; V100 GPUs Now on Google Cloud

Original Submission

AlphaGo Zero Makes AlphaGo Obsolete 39 comments

Google DeepMind researchers have made their old AlphaGo program obsolete:

The old AlphaGo relied on a computationally intensive Monte Carlo tree search to play through Go scenarios. The nodes and branches created a much larger tree than AlphaGo practically needed to play. A combination of reinforcement learning and human-supervised learning was used to build "value" and "policy" neural networks that used the search tree to execute gameplay strategies. The software learned from 30 million moves played in human-on-human games, and benefited from various bodges and tricks to learn to win. For instance, it was trained from master-level human players, rather than picking it up from scratch.

AlphaGo Zero did start from scratch with no experts guiding it. And it is much more efficient: it only uses a single computer and four of Google's custom TPU1 chips to play matches, compared to AlphaGo's several machines and 48 TPUs. Since Zero didn't rely on human gameplay, and a smaller number of matches, its Monte Carlo tree search is smaller. The self-play algorithm also combined both the value and policy neural networks into one, and was trained on 64 GPUs and 19 CPUs over a few days by playing nearly five million games against itself. In comparison, AlphaGo needed months of training and used 1,920 CPUs and 280 GPUs to beat Lee Sedol.

Though self-play AlphaGo Zero even discovered for itself, without human intervention, classic moves in the theory of Go, such as fuseki opening tactics, and what's called life and death. More details can be found in Nature, or from the paper directly here. Stanford computer science academic Bharath Ramsundar has a summary of the more technical points, here.

Go is an abstract strategy board game for two players, in which the aim is to surround more territory than the opponent.

Previously: Google's New TPUs are Now Much Faster -- will be Made Available to Researchers
Google's AlphaGo Wins Again and Retires From Competition

Original Submission

This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 0) by Anonymous Coward on Friday May 19 2017, @01:17AM (1 child)

    by Anonymous Coward on Friday May 19 2017, @01:17AM (#511927)

    > takyon [] writes:

    I've been wondering, is it:
      + ta - ka - on
      + tak - yon
      + tacky - on
    or some other pronunciation?

  • (Score: 2) by MichaelDavidCrawford on Friday May 19 2017, @01:33AM (3 children)

    The TPU is smarter than our Commander-in-Chief.

    Yes I Have No Bananas. []
    • (Score: 3, Interesting) by takyon on Friday May 19 2017, @01:37AM (2 children)

      by takyon (881) <reversethis-{gro ... s} {ta} {noykat}> on Friday May 19 2017, @01:37AM (#511932) Journal

      TPUs are ineligible to become the President of the United States.

      Here's a short story idea for mcgrew: Donald Trump becomes the first human to be mind uploaded, and the resulting AI quickly becomes smarter than any other human while discovering profound insights about its copied former experiences.

      [SIG] 10/28/2017: Soylent Upgrade v14 []
      • (Score: 0) by Anonymous Coward on Friday May 19 2017, @05:01AM

        by Anonymous Coward on Friday May 19 2017, @05:01AM (#512026)

        TPUs are ineligible to become the President of the United States.

        Well in that case, my vote goes to the inanimate carbon rod.

      • (Score: 2) by DeathMonkey on Friday May 19 2017, @05:53PM

        by DeathMonkey (1380) on Friday May 19 2017, @05:53PM (#512277) Journal

        "Holy crap I was literally wrong about everything." THE END

  • (Score: 2) by takyon on Friday May 19 2017, @01:52AM (4 children)

    by takyon (881) <reversethis-{gro ... s} {ta} {noykat}> on Friday May 19 2017, @01:52AM (#511936) Journal

    That's right, GV100/Tesla V100 is rated for 120 TFLOPS of "tensor operations".

    So how much does a TPU cost? []

    More importantly, despite having many more arithmetic units and large on-chip memory, the TPU chip is half the size of the other chips. Since the cost of a chip is a function of the area3 — more smaller chips per silicon wafer and higher yield for small chips since they're less likely to have manufacturing defects* — halving chip size reduces chip cost by roughly a factor of 8 (23).

    Still not enough info, but they might be onto something.

    [SIG] 10/28/2017: Soylent Upgrade v14 []
    • (Score: 2) by takyon on Friday May 19 2017, @02:09AM (3 children)

      by takyon (881) <reversethis-{gro ... s} {ta} {noykat}> on Friday May 19 2017, @02:09AM (#511942) Journal

      The world's fastest supercomputer, Sunway TaihuLight, has 40,960 "Chinese-designed SW26010 manycore 64-bit RISC processors based on the Sunway architecture". Speed is 105 petaflops, 125 petaflops peak (LINPACK, so take it with some salt).

      I believe the "Cloud TPU" is 4 smaller TPUs in one unit (not sure). So tensor performance per individual TPU is 45 (tensor) "teraflops". So you get these numbers []:

      • Google will make 1,000 Cloud TPUs (44 petaFLops) available at no cost to ML researchers via the TensorFlow Research Cloud.
      • 24 second generation TPUs would deliver over 1 petaFlops
      • 256 second generation TPUs in a cluster can deliver 11.5 petaFlops

      It seems to scale well. Anyway, to reach 125 petaflops you would need 2,778 of them, and to get to 1 exaflops, 22,222. It would probably cost well under a billion dollars for Google to build the machine learning equivalent of an exaflop.

      [SIG] 10/28/2017: Soylent Upgrade v14 []
      • (Score: 2) by kaszz on Friday May 19 2017, @04:32AM (2 children)

        by kaszz (4211) on Friday May 19 2017, @04:32AM (#512013) Journal

        36.8e15 FLOPS is the estimated computational power required to simulate a human brain in real time..

        At a price of 30 million dollars?

        • (Score: 2) by HiThere on Friday May 19 2017, @05:10PM (1 child)

          by HiThere (866) on Friday May 19 2017, @05:10PM (#512258) Journal

          It depends on which estimate you use. We don't even have nearly an order of magnitude of that number. Particularly if you allow the exclusion of parts of the brain that are dedicated to, e.g., handling blood chemistry. And particularly if you include speculation that some quantum effects happen in thought.

          In fact, the entire basis of thought isn't really understood, so flops might be a poor way to simulate it. Perhaps integer arithmetic is better. Or fixed point. That flops are important is due to the selected algorithm, and I'm really dubious about it. That said, this doesn't imply that the current "deep learning" approach won't work. It's just that you can't assume that its computational requirements will be equivalent. They could also be either much higher or much lower.

          Javascript is what you use to allow unknown third parties to run software you have no idea about on your computer.
          • (Score: 2) by kaszz on Friday May 19 2017, @05:27PM

            by kaszz (4211) on Friday May 19 2017, @05:27PM (#512270) Journal

            Well now that the capacity becomes available. Maybe it will enable research to find out?

  • (Score: 2) by kaszz on Friday May 19 2017, @04:26AM (4 children)

    by kaszz (4211) on Friday May 19 2017, @04:26AM (#512011) Journal

    It seems how the chip works is pretty well known. So when will the first open source chip show up?
    Any researcher that uses these online TPUs can be sure that google gobles it up. And any productivity, patent, value addition goes to share holders etc.

    A open source solution will enable more free development at a initial performance penalty. In the meantime FPGAs can serve as a test platform?

    • (Score: 0) by Anonymous Coward on Friday May 19 2017, @09:46AM (1 child)

      by Anonymous Coward on Friday May 19 2017, @09:46AM (#512102)

      It seems how the chip works is pretty well known. So when will the first open source chip show up?

      Is some patented technology involved?

      • (Score: 2) by kaszz on Friday May 19 2017, @04:50PM

        by kaszz (4211) on Friday May 19 2017, @04:50PM (#512252) Journal

        Do it in China and get back through mailbag with "electric stuff" ?

    • (Score: 2) by LoRdTAW on Friday May 19 2017, @04:38PM (1 child)

      by LoRdTAW (3755) on Friday May 19 2017, @04:38PM (#512245) Journal

      So when will the first open source chip show up?

      I have a feeling that it won't happen. TPU's might never see the light of day outside of a Google data center because google would lose control of the AI race. We may see open source tools to talk to said AI systems but the platform itself will be a proprietary black box.

      We are moving rapidly towards a closed computing world where the likes of Google, Amazon, and others seek to aggregate all of your computing needs into THEIR data centers. The idea being that everyone has to pay a recurring fee or are duped into freely working for said companies by letting them commoditize data mined from every corner of your digital life and sell it. You will also be duped into doing the dirty footwork for gathering other data such as poking through your videos, photos, locations and other data.

      All we will be left with are locked down dumb terminals in the form of tablets, "smart" TV's and phones. Desktop/Laptop computers will eventually be abandoned by said companies to "focus on delivering a user friendly multimedia platform" . They will be about as modern and stylish as 1970's decor. The walled gardens which are for now avoidable might one day be the only choice. I once thought it was paranoia to think like this but now it is becoming more and more real at a faster and faster pace. VR, AR, AI, all every other buzz acronym will all have us enslaved one day. But not to a giant computer but the almighty dollar, the man behind the curtain.

      • (Score: 2) by kaszz on Friday May 19 2017, @04:47PM

        by kaszz (4211) on Friday May 19 2017, @04:47PM (#512250) Journal

        It will be the data center behind the fiber ;-)

        The component to make free is the process to make chips. Once that is accomplished their monopoly significantly decreases.

  • (Score: 2) by jasassin on Friday May 19 2017, @04:48AM (1 child)

    by jasassin (3566) <> on Friday May 19 2017, @04:48AM (#512022) Journal

    Yeah, but can it run Grand Theft Auto 5 at 4K over 60FPS?

    -- Key fingerprint = 0644 173D 8EED AB73 C2A6 B363 8A70 579B B6A7 02CA
    • (Score: 2) by kaszz on Friday May 19 2017, @06:04AM

      by kaszz (4211) on Friday May 19 2017, @06:04AM (#512041) Journal

      Maybe if you can wire a really fast bus between a board with TPUs and the graphics card. Probably using the SLI port or PCI-e data pair. Then it's just the task to write some code..

  • (Score: 1, Informative) by Anonymous Coward on Friday May 19 2017, @11:40AM (1 child)

    by Anonymous Coward on Friday May 19 2017, @11:40AM (#512121)

    TL;DR TPU = google's new funky spy processor

    Tensor processing units (or TPUs) are application-specific integrated circuits (ASICs) developed specifically for machine learning. Compared to graphics processing units, they are designed explicitly for a higher volume of reduced precision computation (e.g. as little as 8-bit precision) with higher IOPS per watt, and lack hardware for rasterisation/texture mapping. The chip has been specifically designed for Google's TensorFlow framework, however Google still uses CPUs and GPUs for other types of machine learning. Other AI accelerator designs are appearing from other vendors also and are aimed at embedded and robotics markets. -- []

    (shit like this would be good in the summary BTW HTH LOL)

    • (Score: 0) by Anonymous Coward on Friday May 19 2017, @12:19PM

      by Anonymous Coward on Friday May 19 2017, @12:19PM (#512134)

      > ... Compared to graphics processing units, they are designed explicitly for a higher volume of reduced precision computation (e.g. as little as 8-bit precision) ...

      When you say it like that, it almost sounds like ML could be viewed as an extension of Fuzzy Logic, or at least use Fuzzy as an analogy? (sorry, don't have any car analogies today)

      My take (back then) was that the Fuzzy proponents took a very low precision approach to control systems--but higher precision than the most simple controllers like a bang-bang thermostat. Instead of all that boring system identification and modeling of the plant in classical/analog control theory, Fuzzy promised quick 'n dirty stable controllers that worked "well enough" for some applications.