posted by
cmn32480
on Saturday March 12 2016, @08:27PM
from the teeny-websites-shouldn't-need-all-that-compute dept.
from the teeny-websites-shouldn't-need-all-that-compute dept.
Facebook have worked closely with Intel to create the new Xeon-D class of CPUs. By eliminating the need for chipsets to deal with two separate, physical CPUs, optimising the CPU for their workload, and reducing the footprint and power of other components, they can achieve better throughput per watt (and per rack) than using 2 CPU servers.
For a reasonably deep description, read more on Facebook's Hardware Blog.
This discussion has been archived.
No new comments can be posted.
Facebook and Intel Collaborate on Single CPU Servers
|
Log In/Create an Account
| Top
| 17 comments
| Search Discussion
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
(Score: 0) by Anonymous Coward on Saturday March 12 2016, @08:33PM
Engineers who designed CPUs for Facebook still refuse to use Facebook for any reason.
(Score: 3, Funny) by takyon on Saturday March 12 2016, @08:55PM
Maybe they use it to spy on their kids.
[SIG] 10/28/2017: Soylent Upgrade v14 [soylentnews.org]
(Score: 1, Touché) by Anonymous Coward on Sunday March 13 2016, @04:56PM
Instead of reading Grimms' fairy tales, future parents may tell a scary bedtime story about Facebook and Intel.
(Score: 2) by bitstream on Saturday March 12 2016, @11:11PM
I'm kind of surprised that multi-CPU setups would be less efficient. So then the question is if that is a design fault or something fundamental (fragmenting the problem use more resources than is gained by parallelization).
(Score: 2) by frojack on Saturday March 12 2016, @11:45PM
Was it multi-CPUs or Multi-Cores that they are trying to eliminate?
I can see that the former might have more bottlenecks and power issues than multi-cores, but since I refuse to click any facebook link I have only TFS to go on.
No, you are mistaken. I've always had this sig.
(Score: 0) by Anonymous Coward on Sunday March 13 2016, @12:42AM
Looks like it boiled down to an efficiency issue. Multi cores had wasted time parallelizing memory operations and ended up being more expensive (on an instructions per watt) basis.
Some this also seems specific to their software architecture as well.
(Score: 2) by bitstream on Sunday March 13 2016, @01:07AM
If it's specific to Facebook then SMP is ok. If it's general then there's a big problem..
(Score: 0) by Anonymous Coward on Sunday March 13 2016, @01:39AM
TFA mentioned NUMA as something that was going away, so cache coherence and occasional non-local memory access were evidently bottlenecks. Remember, the cost is based on watts as well as the price of the chips.
(Score: 0) by Anonymous Coward on Sunday March 13 2016, @04:35PM
So here is what they did.
1 cpu per board. So no trying to keep the cache coherent between the two CPUs (still have to in the cores). Move as much of the Z120 bridge logic onto the CPU as they can (memory controller, PCI controller). Not quite SoC but getting there. Also removes a decent amount of the power/space they were using.
They lowered the power from 120w to 65w per CPU.
They increased the number of cores in each CPU from 12 to 16.
They said they are very single thread bound on many things. So more cores = better perf for them.
This let them put 4 boards in instead of 1 because of the lower power per 'sled'. As each computer board was taking ~240 before. 4 smaller boards takes the same amount but does but 2x the CPU.
They increase in the same space from 24 threads to 64 for the same power footprint. More if they enable hyperthreading and use it correctly. They say they are not I/O bound so hyperthreading may hurt them.
They increased the memory footprint from 32 gig to 128 per 'sled'. With each CPU getting a dedicated 32GB instead of a shared 32 for 2 before.
They also changed the network interface a bit. Whereas before each 'sled' had its own dedicated NIC card now 4 boards share (sameish design). But instead of 2 CPUs having 1 ip each CPU has its own IP. They alleviated that problem a bit by increasing the capacity from 10gb to 50gb. They also put a router into the local NIC if they want the CPU to talk to each other instead of having to go out and back to the router.
1 cpu per board = less power = more memory = more boards = higher throughput.
Not a terribly surprising result for the type of web page they have. Still is interesting they need such a beefy CPU to serve up webpages. They even mentioned that. As they considered using ATOM/ARM cpus.
(Score: 1, Informative) by Anonymous Coward on Sunday March 13 2016, @01:25AM
In a purely theoretical world, it be more efficient. In the real world, unfortunately compromises are made. The problem is RAM: not only is it (much) slower than current CPUs, but it's 1-ported in most current (PC, etc.) architectures. So each CPU can only work at speed on its internal cache's contents, and at some point it has to read and write main RAM, and then and there you have the big bottleneck.
Many many RAM "banks" with motherboard-level large fast cache would help (it would be conceptually similar to RAID striping, if you understand that.) Maybe someone is doing it, but I haven't seen motherboard cache RAM in a long long time.
Today's hardware is amazingly fast, but it would be much better utilized by software that was specifically written to run on the multi-CPU hardware, considering CPU cache sizes and general FSB/RAM architecture.
I do some admin work, and as an EE it's been bugging me seeing faster and more CPU cores, more RAM, virtualization, etc. Yes, you can jamb more server instances in the same rack size, but there's a tremendous loss of efficiency with all of the software _and_ hardware context switching.
(Score: 0) by Anonymous Coward on Sunday March 13 2016, @02:56AM
look to power7 and power8 from IBM. designed to cross connect multiple cpus3 (blocks of 4) with each having its own ram. I used to use a Power5 with 64 cores over 16 cpus w/ 2TB of memory. Also each core ran 6 threads at a time. Think intel hyper-thread, just bigger. At that time you could install one of these machines as a single user (we had 5000 users running across 16 LPRS (VMs you most of you).
(Score: 2) by bitstream on Sunday March 13 2016, @12:51PM
But did all these cores with local RAM become more efficient than running the software on single core machines? And Power(tm) chips are also more expensive than others asfaik. So there's many variables to take into account in order to properly evaluate between setups.
(Score: 2) by bitstream on Sunday March 13 2016, @12:46PM
The first line of solution is perhaps to write software with the model that one can use the core(s) and their L1-L2-L3 cache fully. But that one should in most cases avoid frequent reliance on external RAM. I expect that Facebook has fully optimized their software such that software isn't the roadblock. They have a such specific application and lot's of resources that it makes sense to go for deep optimization.
Is the CPU socket a large hindrance? If that's the case then physically multiple CPUs is the way around. But then one might as well have separate computers. If the CPU socket have the capacity then one could as you write use some memory striping to push towards faster speeds. Perhaps that's what the the AMD CPU socket bus AM3 did with dual ported RAM access?
And why RAM striping on the motherboard isn't widely used is kind of counter intuitive. And yeah all this virtualization stuff is great but context switching is not saving resources.. Neither seems there to be any intentional design to make all parts work together efficiently, a simple example is the propagation latency between all cache levels without any ability for software to hint that memory X should be taken from motherboard RAM directly etc.
(Score: 2) by davester666 on Sunday March 13 2016, @08:44AM
Not surprising. They are using one CPU with more cores vs two CPU's with fewer cores per CPU. Fewer cpu's mean fewer CPU support chips to access memory, network, disk, etc.
(Score: 2) by bitstream on Sunday March 13 2016, @01:04PM
Fewer CPUs also means that parts of the processing has to be done elsewhere. So the simplified question is if it's more efficient with one CPU with two cores in one machine vs two machines with it's own single core CPU. The article seems to suggest the latter is the case. Which makes one to question the usefulness of the multi-core CPU paradigm.
(Score: 0) by Anonymous Coward on Sunday March 13 2016, @02:40AM
Yeah, two boxes with single cpu each will be more efficient than a single box with two cpus. WTF.
(Score: 2) by linkdude64 on Sunday March 13 2016, @04:08AM
"Link your Intel vPro and Facebook user accounts to make logins a breeze!" "MOTD: OMG Did you see Clinton's hair the other night? #TheBomb" drawing a blank on any others cause I just woke up, but damn, this is a match made in Hell. The techno-oligarchy is really coming into form.