Did the Chinese government invent an infinitely scalable processor switch fabric or are the figures from Chinese super-computers a complete fabrication? Indeed, if the Chinese super-computer switch fabric is so good, why does China continue to use Cisco switches for the Great Firewall Of China and why does China intend to use Cisco switches to connect all of its population?
The conventional super-computer design is to have 2^3n nodes in a three dimensional grid. For example, 4096 nodes in a 16×16×16 grid. Each node is connected to three uni-directional data loops. Ignoring network congestion, the round-trip time between any two nodes in a loop is constant. The round-trip time to any two nodes in a plane is constant. And the round-trip time to between arbitrary nodes is constant. In this arrangement, nodes have direct access to each other's memory and it is fairly easy to implement a memory interface which provides equal access to three network links and a local node.
The rate of throughput is obscene. An IntelXeon processor may have up to 32 cores and each core has two-way hyper-threading. Each thread may consume up to five bytes of instruction per clock cycle and the clock is 4GHz. That's a peak instuction stream execution rate of 1280GB/s per node. For a 4096 node cluster, that's 5PB/s. Memory addressing is also special. 12 or more bits of an address are required merely to specify node number. With 4GB RAM per node, take 32 bit addressing and add another 12 bits for physically addressable RAM.
This arrangement is highly symmetric, highly fragile and rapidly runs into scalability problems. And yet, China just adds a cabinet or two of processor nodes and always retains to world super-computer record. Or adds GPUs to a subset of nodes. (Where does the extra memory interface come from?) Is China using partially connected, two local, six remote hypercube topology? Is there any known upper bound for this switch fabric which is at least five years old?
Assuming China's claims are true, it is possible to make a heterogeneous super-computer cluster with more than 16000 nodes and have at least 1Gb/s bandwidth per node without any significant MTBF. Even the cross-sectional area of first level cache of 16000 MIPS processors creates a significant target for random bit errors. Likewise for ALUs, FPUs and GPUs.
I investigated redundancy for storage and processing. The results are disappointing because best practice is known but rarely followed. For storage, six parity nodes in a cluster is the preferred minimum. Four is rare and zero or one is the typical arrangement. For processing, anything beyond a back-check creates more problems than solutions. Best-of-three is great but it triples energy consumption and rate of processor failure. With storage, it is mirrored at the most abstract level and that many be on different continents. At the very worst, it will be on separate magnetic platters accessed via separate micro-controllers. With processing, redundancy is on the same chip on the same board with the same power supply.
So, for processing, best practice is a twin processor back-check, like the mainframes from the 1970s. For a super-computer cluster, every node should participate in a global check-point and every computation should be mirrored on another node. Errors in parallel computation propagate extremely fast and therefore if any node finds an error, all nodes must wind back to a previous check-point. Checks can be performed entirely in software but it is also possible for a processor with two-way hyper-threading, with two sets of registers and also two ALUs and two FPUs, to run critical tasks in lock-step and throw an exception when results don't match.
Now that I've considered it in detail, it is apparent to me that Intel has never been keen on anything beyond two-way hyper-threading. I just assumed Intel was shipping dark threads to facilitate snooping. ("Hey! You've actually got eight-way hyper-threading but we use six to compress video of you fapping when you think your webcam is off.") But perhaps selected customers get a reliable mode and us plebs get the partial rejects without scalability.
My Ideal Processor, Part 2
(This is the 48th of many promised articles which explain an idea in isolation. It is hoped that ideas may be adapted, linked together and implemented.)
Did the Chinese government invent an infinitely scalable processor switch fabric or are the figures from Chinese super-computers a complete fabrication? Indeed, if the Chinese super-computer switch fabric is so good, why does China continue to use Cisco switches for the Great Firewall Of China and why does China intend to use Cisco switches to connect all of its population?
The conventional super-computer design is to have 2^3n nodes in a three dimensional grid. For example, 4096 nodes in a 16×16×16 grid. Each node is connected to three uni-directional data loops. Ignoring network congestion, the round-trip time between any two nodes in a loop is constant. The round-trip time to any two nodes in a plane is constant. And the round-trip time to between arbitrary nodes is constant. In this arrangement, nodes have direct access to each other's memory and it is fairly easy to implement a memory interface which provides equal access to three network links and a local node.
The rate of throughput is obscene. An Intel Xeon processor may have up to 32 cores and each core has two-way hyper-threading. Each thread may consume up to five bytes of instruction per clock cycle and the clock is 4GHz. That's a peak instuction stream execution rate of 1280GB/s per node. For a 4096 node cluster, that's 5PB/s. Memory addressing is also special. 12 or more bits of an address are required merely to specify node number. With 4GB RAM per node, take 32 bit addressing and add another 12 bits for physically addressable RAM.
This arrangement is highly symmetric, highly fragile and rapidly runs into scalability problems. And yet, China just adds a cabinet or two of processor nodes and always retains to world super-computer record. Or adds GPUs to a subset of nodes. (Where does the extra memory interface come from?) Is China using partially connected, two local, six remote hypercube topology? Is there any known upper bound for this switch fabric which is at least five years old?
Assuming China's claims are true, it is possible to make a heterogeneous super-computer cluster with more than 16000 nodes and have at least 1Gb/s bandwidth per node without any significant MTBF. Even the cross-sectional area of first level cache of 16000 MIPS processors creates a significant target for random bit errors. Likewise for ALUs, FPUs and GPUs.
I investigated redundancy for storage and processing. The results are disappointing because best practice is known but rarely followed. For storage, six parity nodes in a cluster is the preferred minimum. Four is rare and zero or one is the typical arrangement. For processing, anything beyond a back-check creates more problems than solutions. Best-of-three is great but it triples energy consumption and rate of processor failure. With storage, it is mirrored at the most abstract level and that many be on different continents. At the very worst, it will be on separate magnetic platters accessed via separate micro-controllers. With processing, redundancy is on the same chip on the same board with the same power supply.
So, for processing, best practice is a twin processor back-check, like the mainframes from the 1970s. For a super-computer cluster, every node should participate in a global check-point and every computation should be mirrored on another node. Errors in parallel computation propagate extremely fast and therefore if any node finds an error, all nodes must wind back to a previous check-point. Checks can be performed entirely in software but it is also possible for a processor with two-way hyper-threading, with two sets of registers and also two ALUs and two FPUs, to run critical tasks in lock-step and throw an exception when results don't match.
Now that I've considered it in detail, it is apparent to me that Intel has never been keen on anything beyond two-way hyper-threading. I just assumed Intel was shipping dark threads to facilitate snooping. ("Hey! You've actually got eight-way hyper-threading but we use six to compress video of you fapping when you think your webcam is off.") But perhaps selected customers get a reliable mode and us plebs get the partial rejects without scalability.