Stories
Slash Boxes
Comments

SoylentNews is people

posted by cmn32480 on Saturday March 12 2016, @08:27PM   Printer-friendly
from the teeny-websites-shouldn't-need-all-that-compute dept.

Facebook have worked closely with Intel to create the new Xeon-D class of CPUs. By eliminating the need for chipsets to deal with two separate, physical CPUs, optimising the CPU for their workload, and reducing the footprint and power of other components, they can achieve better throughput per watt (and per rack) than using 2 CPU servers.

For a reasonably deep description, read more on Facebook's Hardware Blog.


Original Submission

This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 0) by Anonymous Coward on Saturday March 12 2016, @08:33PM

    by Anonymous Coward on Saturday March 12 2016, @08:33PM (#317429)

    Engineers who designed CPUs for Facebook still refuse to use Facebook for any reason.

    • (Score: 3, Funny) by takyon on Saturday March 12 2016, @08:55PM

      by takyon (881) <reversethis-{gro ... s} {ta} {noykat}> on Saturday March 12 2016, @08:55PM (#317433) Journal

      Maybe they use it to spy on their kids.

      --
      [SIG] 10/28/2017: Soylent Upgrade v14 [soylentnews.org]
      • (Score: 1, Touché) by Anonymous Coward on Sunday March 13 2016, @04:56PM

        by Anonymous Coward on Sunday March 13 2016, @04:56PM (#317673)

        Instead of reading Grimms' fairy tales, future parents may tell a scary bedtime story about Facebook and Intel.

  • (Score: 2) by bitstream on Saturday March 12 2016, @11:11PM

    by bitstream (6144) on Saturday March 12 2016, @11:11PM (#317473) Journal

    I'm kind of surprised that multi-CPU setups would be less efficient. So then the question is if that is a design fault or something fundamental (fragmenting the problem use more resources than is gained by parallelization).

    • (Score: 2) by frojack on Saturday March 12 2016, @11:45PM

      by frojack (1554) on Saturday March 12 2016, @11:45PM (#317491) Journal

      Was it multi-CPUs or Multi-Cores that they are trying to eliminate?

      I can see that the former might have more bottlenecks and power issues than multi-cores, but since I refuse to click any facebook link I have only TFS to go on.

      --
      No, you are mistaken. I've always had this sig.
      • (Score: 0) by Anonymous Coward on Sunday March 13 2016, @12:42AM

        by Anonymous Coward on Sunday March 13 2016, @12:42AM (#317507)

        Looks like it boiled down to an efficiency issue. Multi cores had wasted time parallelizing memory operations and ended up being more expensive (on an instructions per watt) basis.

        Some this also seems specific to their software architecture as well.

        • (Score: 2) by bitstream on Sunday March 13 2016, @01:07AM

          by bitstream (6144) on Sunday March 13 2016, @01:07AM (#317514) Journal

          If it's specific to Facebook then SMP is ok. If it's general then there's a big problem..

      • (Score: 0) by Anonymous Coward on Sunday March 13 2016, @01:39AM

        by Anonymous Coward on Sunday March 13 2016, @01:39AM (#317522)

        TFA mentioned NUMA as something that was going away, so cache coherence and occasional non-local memory access were evidently bottlenecks. Remember, the cost is based on watts as well as the price of the chips.

        • (Score: 0) by Anonymous Coward on Sunday March 13 2016, @04:35PM

          by Anonymous Coward on Sunday March 13 2016, @04:35PM (#317664)

          So here is what they did.

          1 cpu per board. So no trying to keep the cache coherent between the two CPUs (still have to in the cores). Move as much of the Z120 bridge logic onto the CPU as they can (memory controller, PCI controller). Not quite SoC but getting there. Also removes a decent amount of the power/space they were using.

          They lowered the power from 120w to 65w per CPU.

          They increased the number of cores in each CPU from 12 to 16.

          They said they are very single thread bound on many things. So more cores = better perf for them.

          This let them put 4 boards in instead of 1 because of the lower power per 'sled'. As each computer board was taking ~240 before. 4 smaller boards takes the same amount but does but 2x the CPU.

          They increase in the same space from 24 threads to 64 for the same power footprint. More if they enable hyperthreading and use it correctly. They say they are not I/O bound so hyperthreading may hurt them.

          They increased the memory footprint from 32 gig to 128 per 'sled'. With each CPU getting a dedicated 32GB instead of a shared 32 for 2 before.

          They also changed the network interface a bit. Whereas before each 'sled' had its own dedicated NIC card now 4 boards share (sameish design). But instead of 2 CPUs having 1 ip each CPU has its own IP. They alleviated that problem a bit by increasing the capacity from 10gb to 50gb. They also put a router into the local NIC if they want the CPU to talk to each other instead of having to go out and back to the router.

          1 cpu per board = less power = more memory = more boards = higher throughput.

          Not a terribly surprising result for the type of web page they have. Still is interesting they need such a beefy CPU to serve up webpages. They even mentioned that. As they considered using ATOM/ARM cpus.

    • (Score: 1, Informative) by Anonymous Coward on Sunday March 13 2016, @01:25AM

      by Anonymous Coward on Sunday March 13 2016, @01:25AM (#317519)

      I'm kind of surprised that multi-CPU setups would be less efficient. So then the question is if that is a design fault or something fundamental (fragmenting the problem use more resources than is gained by parallelization).

      In a purely theoretical world, it be more efficient. In the real world, unfortunately compromises are made. The problem is RAM: not only is it (much) slower than current CPUs, but it's 1-ported in most current (PC, etc.) architectures. So each CPU can only work at speed on its internal cache's contents, and at some point it has to read and write main RAM, and then and there you have the big bottleneck.

      Many many RAM "banks" with motherboard-level large fast cache would help (it would be conceptually similar to RAID striping, if you understand that.) Maybe someone is doing it, but I haven't seen motherboard cache RAM in a long long time.

      Today's hardware is amazingly fast, but it would be much better utilized by software that was specifically written to run on the multi-CPU hardware, considering CPU cache sizes and general FSB/RAM architecture.

      I do some admin work, and as an EE it's been bugging me seeing faster and more CPU cores, more RAM, virtualization, etc. Yes, you can jamb more server instances in the same rack size, but there's a tremendous loss of efficiency with all of the software _and_ hardware context switching.

      • (Score: 0) by Anonymous Coward on Sunday March 13 2016, @02:56AM

        by Anonymous Coward on Sunday March 13 2016, @02:56AM (#317536)

        look to power7 and power8 from IBM. designed to cross connect multiple cpus3 (blocks of 4) with each having its own ram. I used to use a Power5 with 64 cores over 16 cpus w/ 2TB of memory. Also each core ran 6 threads at a time. Think intel hyper-thread, just bigger. At that time you could install one of these machines as a single user (we had 5000 users running across 16 LPRS (VMs you most of you).

        • (Score: 2) by bitstream on Sunday March 13 2016, @12:51PM

          by bitstream (6144) on Sunday March 13 2016, @12:51PM (#317610) Journal

          But did all these cores with local RAM become more efficient than running the software on single core machines? And Power(tm) chips are also more expensive than others asfaik. So there's many variables to take into account in order to properly evaluate between setups.

      • (Score: 2) by bitstream on Sunday March 13 2016, @12:46PM

        by bitstream (6144) on Sunday March 13 2016, @12:46PM (#317609) Journal

        The first line of solution is perhaps to write software with the model that one can use the core(s) and their L1-L2-L3 cache fully. But that one should in most cases avoid frequent reliance on external RAM. I expect that Facebook has fully optimized their software such that software isn't the roadblock. They have a such specific application and lot's of resources that it makes sense to go for deep optimization.

        Is the CPU socket a large hindrance? If that's the case then physically multiple CPUs is the way around. But then one might as well have separate computers. If the CPU socket have the capacity then one could as you write use some memory striping to push towards faster speeds. Perhaps that's what the the AMD CPU socket bus AM3 did with dual ported RAM access?

        And why RAM striping on the motherboard isn't widely used is kind of counter intuitive. And yeah all this virtualization stuff is great but context switching is not saving resources.. Neither seems there to be any intentional design to make all parts work together efficiently, a simple example is the propagation latency between all cache levels without any ability for software to hint that memory X should be taken from motherboard RAM directly etc.

    • (Score: 2) by davester666 on Sunday March 13 2016, @08:44AM

      by davester666 (155) on Sunday March 13 2016, @08:44AM (#317580)

      Not surprising. They are using one CPU with more cores vs two CPU's with fewer cores per CPU. Fewer cpu's mean fewer CPU support chips to access memory, network, disk, etc.

      • (Score: 2) by bitstream on Sunday March 13 2016, @01:04PM

        by bitstream (6144) on Sunday March 13 2016, @01:04PM (#317614) Journal

        Fewer CPUs also means that parts of the processing has to be done elsewhere. So the simplified question is if it's more efficient with one CPU with two cores in one machine vs two machines with it's own single core CPU. The article seems to suggest the latter is the case. Which makes one to question the usefulness of the multi-core CPU paradigm.

  • (Score: 0) by Anonymous Coward on Sunday March 13 2016, @02:40AM

    by Anonymous Coward on Sunday March 13 2016, @02:40AM (#317531)

    Yeah, two boxes with single cpu each will be more efficient than a single box with two cpus. WTF.

  • (Score: 2) by linkdude64 on Sunday March 13 2016, @04:08AM

    by linkdude64 (5482) on Sunday March 13 2016, @04:08AM (#317551)

    "Link your Intel vPro and Facebook user accounts to make logins a breeze!" "MOTD: OMG Did you see Clinton's hair the other night? #TheBomb" drawing a blank on any others cause I just woke up, but damn, this is a match made in Hell. The techno-oligarchy is really coming into form.