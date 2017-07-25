GPUhammer is the first to flip bits in onboard GPU memory. It likely won't be the last:
Nvidia is recommending a mitigation for customers of one of its GPU product lines that will degrade performance by up to 10 percent in a bid to protect users from exploits that could let hackers sabotage work projects and possibly cause other compromises.
The move comes in response to an attack a team of academic researchers demonstrated against Nvidia's RTX A6000, a widely used GPU for high-performance computing that's available from many cloud services. A vulnerability the researchers discovered opens the GPU to Rowhammer, a class of attack that exploits physical weakness in DRAM chip modules that store data.
Rowhammer allows hackers to change or corrupt data stored in memory by rapidly and repeatedly accessing—or hammering—a physical row of memory cells. By repeatedly hammering carefully chosen rows, the attack induces bit flips in nearby rows, meaning a digital zero is converted to a one or vice versa. Until now, Rowhammer attacks have been demonstrated only against memory chips for CPUs, used for general computing tasks.
[...] The researchers' proof-of-concept exploit was able to tamper with deep neural network models used in machine learning for things like autonomous driving, healthcare applications, and medical imaging for analyzing MRI scans. GPUHammer flips a single bit in the exponent of a model weight—for example in y, where a floating point is represented as x times 2y. The single bit flip can increase the exponent value by 16. The result is an altering of the model weight by a whopping 216, degrading model accuracy from 80 percent to 0.1 percent, said Gururaj Saileshwar, an assistant professor at the University of Toronto and co-author of an academic paper demonstrating the attack.
"This is like inducing catastrophic brain damage in the model: with just one bit flip, accuracy can crash from 80% to 0.1%, rendering it useless," Saileshwar wrote in an email. "With such accuracy degradation, a self-driving car may misclassify stop signs (reading a stop sign as a speed limit 50 mph sign), or stop recognizing pedestrians. A healthcare model might misdiagnose patients. A security classifier may fail to detect malware."
In response, Nvidia is recommending users implement a defense that could degrade overall performance by as much as 10 percent. Among machine learning inference workloads the researchers studied, the slowdown affects the "3D U-Net ML Model" the most. This model is used for an array of HPC tasks, such as medical imaging.
The performance hit is caused by the resulting reduction in bandwidth between the GPU and the memory module, which the researchers estimated as 12 percent. There's also a 6.25 percent loss in memory capacity across the board, regardless of the workload. Performance degradation will be the highest for applications that access large amounts of memory.
A figure in the researchers' academic paper provides the overhead breakdowns for the workloads tested.
(Score: 3, Insightful) by weirsbaski on Friday July 18, @04:08AM (3 children)
I'll give them healthcare models and security classifiers which may run on PC's, but Rowhammer in self-driving cars? Aren't car's computers closed systems, where dangerous s/w couldn't be added without being signed or some such? Otherwise what would stop me from fixing the s/w bugs and annoyances in my car's infotainment system?
(Score: 0) by Anonymous Coward on Friday July 18, @04:19AM
Your Game might show the enemy over HERE, but really they're over THERE! *Shunk!* you lose.
Be afraaaaaaid. Be veeerrrryyy afraid!!!
sigh. It's like the "publish or perish" of academia has hit the security industry, so now we're innundated with bullshit.
(Score: 1, Insightful) by Anonymous Coward on Friday July 18, @05:18AM
They shouldn't be doing such stuff on the cloud anyway.
For on prem, if you are already running malicious software on the MRI computers, you're doing things badly wrong, or the hacker has pwned you so badly that there's little point for the hacker to waste time with rowhammer attacks.
Moral of the story? Don't use cloud stuff if security matters that much.
Don't forget some cloud providers might have a worse track record for security ( or reliability and availability) than your own organization: https://www.bleepingcomputer.com/news/security/stolen-microsoft-key-offered-widespread-access-to-microsoft-cloud-services/ [bleepingcomputer.com]
(Score: 3, Funny) by Unixnut on Friday July 18, @11:07AM
> Aren't car's computers closed systems, where dangerous s/w couldn't be added without being signed or some such?
Depends on whether they offload some of the processing to the cloud. Modern cars are "always-on" connected, so if you have an internet link that is always available, I can see that car manufacturers could save money on component cost by offloading processing to the cloud (especially if they just outsource the back-end processing to the lowest bidder). Sure low latency stuff would have to stay local, but for example classifiers (such as what constitutes a stop sign or a pedestrian) could be offloaded as those things don't change pattern so often to require low-latency processing.
(Score: 2) by Rich on Friday July 18, @03:09PM (1 child)
I seriously wonder why error correction (or at least detection) isn't a thing at all, not even in "controlled" sectors like medical. The frickin' first "PC" ever, the IBM 5150, had memory error detection. Modern consumer DRAM might have some sort of on-die ECC (cf. https://www.synopsys.com/articles/ecc-memory-error-correction.html [synopsys.com]) as a tradeoff between extra area and yield. E.g. the manufacturer ships chips fully aware that they have errors, but relies on ECC to catch them, much like with NAND flash. But if that would be a serious thing, Rowhammer wouldn't be an issue.
And even if hard ECC couldn't be done because of razor margins on hardware and buyers always getting the cheapest, some kind of detection could be done in software. 99% of the time, a computer sits idle. It could easily build checksums over the memory pages and occasionally check those, or discard them on a write access (or do a check before the first write to a "cold" page becoming "hot"). Only the "hot" pages would go unchecked, and even these could be rotated around.
I once searched for the latter. I'd expect tinkering-enabling like Linux to have such stuff at least somewhere. Someone even brought up that question. Total apathy as response, even though variations of Rowhammer are still a thing. Some software might have access patterns that accidentally flip bits. Or one, a single one, of the 274877906944 memory capacitors in the machine I type this on might be slightly below the quality of its 274877906943 peers and very occasionally have a little amnesia attack.
Soft ECC could
- detect cosmic ray (or rowhammer-like) flips and decide what to do whether it happened
-- in unused memory (just count it)
-- in cache memory (discard it and count it)
-- in a live porcess (warn the user and count it)
and with enough accumulated count, if seen in a single location
- point out where weak memory is and maybe sort of bad-block it.
(Score: 3, Insightful) by kolie on Friday July 18, @10:36PM
ECC circuitry costs power, density, and heat optimization. We want bigger fast larger cheaperest models NAO!!