ARM will replace the big.LITTLE cluster design with a new one that allows up to 8 CPU cores per cluster, different types of cores within a cluster, and anywhere from one to many (unlimited?) clusters:
The first stage of DynamIQ is a larger cluster paradigm - which means up to eight cores per cluster. But in a twist, there can be a variable core design within a cluster. Those eight cores could be different cores entirely, from different ARM Cortex-A families in different configurations.Many questions come up here, such as how the cache hierarchy will allow threads to migrate between cores within a cluster (perhaps similar to how threads migrate between clusters on big.Little today), even when cores have different cache arrangements. ARM did not yet go into that level of detail, however we were told that more information will be provided in the coming months.Each variable core-configuration cluster will be a part of a new fabric, with uses additional power saving modes and aims to provide much lower latency. The underlying design also allows each core to be controlled independently for voltage and frequency, as well as sleep states. Based on the slide diagrams, various other IP blocks, such as accelerators, should be able to be plugged into this fabric and benefit from that low latency. ARM quoted elements such as safety critical automotive decisions can benefit from this.
The first stage of DynamIQ is a larger cluster paradigm - which means up to eight cores per cluster. But in a twist, there can be a variable core design within a cluster. Those eight cores could be different cores entirely, from different ARM Cortex-A families in different configurations.
Many questions come up here, such as how the cache hierarchy will allow threads to migrate between cores within a cluster (perhaps similar to how threads migrate between clusters on big.Little today), even when cores have different cache arrangements. ARM did not yet go into that level of detail, however we were told that more information will be provided in the coming months.
Each variable core-configuration cluster will be a part of a new fabric, with uses additional power saving modes and aims to provide much lower latency. The underlying design also allows each core to be controlled independently for voltage and frequency, as well as sleep states. Based on the slide diagrams, various other IP blocks, such as accelerators, should be able to be plugged into this fabric and benefit from that low latency. ARM quoted elements such as safety critical automotive decisions can benefit from this.
A tri-cluster smartphone design using 2 high-end cores, 2 mid-level cores, and 4 low-power cores could be replaced by one that uses all three types of core in the same single cluster. The advantage of that approach remains to be seen.
More about ARM big.LITTLE.
Their high end chips had an old ARM (7?) for low level stuff, an ARM9 driving the phone, and an ARM11 for apps. All on 1 chunk of silicon. Haven't been there in 7-8 years, dunno what's in their new chips.
That design does not seem very generalized.
The workload should be able to shift to different cores if a significant number of ARM cores are destroyed within a Borg vessel.
The radio is always kept isolated to prevent the insecure Android side from possibly being able to get at the physical interface of the radio. Putting an entirely separate CPU, RAM and FLASH, often with only a serial link to the main CPU is secure, especially since the radio processor tends to only boot signed images. Some get cheap and use the newer ARM cpu's ability to partition off a really secure section of memory and run in a super better than ring 0 mode but I bet the FCC doesn't like it and makes that known. The Wifi is the same way, dedicated signed firmware on a dedicated CPU, usually connected by an internal USB link. Because those radios are basically a software defined radio that is physically capable of all sorts of fun things... IF we could get our hands on them. The FCC ain't having none of that.
It really is crazy how many processing units a phone can stuff in. My old crappy Tegra3 based phone has four fast ARM cores, one slow ARM core, one ARM "AVP" core as a co-processor (it is actually the boot processor and does the secure boot stuff and starts the main one, then idles with it's own dedicated 256K block of static ram to help (along with yet another undocumented specialty processing unit) play media files and do sleep/wake, etc. Then there is a crypto processor that NVidia won't document in the tech manual, a couple of GPU cores, the radio is on an entirely different chip made by Intel with dedicated ram/flash. Same for BT, GPS and NFC, they have a small CPU in them, type unknown and there is even a little one in the SIM card. It truly is amazing the computing plenty we take for granted.
I think they are still making those, they are used in a couple of the $150-$200 BLU phones I've been looking at as well as some Alcatel One Touch models. IIRC the new ones are octocores and have 2 of the ARM 7s for low power tasks like checking email with the screen off, 2 of the ARM 9 for phone tasks, and 4 of the ARM 11s for the apps. Pretty impressive if you ask me.
General-purpose programming languages cannot keep up with these special-purpose designs; have you read the latest C++/C thread model? It makes no sense! It's as understandable as Quantum Mechanics.
Add to this increasingly complexity the looming specter of proprietary computing "fabrics", and there's just no room for anything but massive, monopolistic, top-down, dictatorial, walled-garden, magical corporate overlording.
Personal Computing is dead.
You gotta learn to app! Only apps matter. You also need to let go of control and also IoT more.
Don't think about language first. Think in terms of Map and Reduce frameworks. Look for problems that can be decomposed into much smaller pieces. Ideally the class of problems that are "embarrassingly parallel". For example, computing each pixel of an image of the Mandelbrot set. Or any other pixel computation where each pixel is computed independently of its neighbors. (eg, 3D rendering, many photoshop filters, maybe video encode / decode) Or if the setup/teardown overhead is too high for computing a single pixel, then divide the image into blocks of pixels that are computed iteratively. Example, break a 4096x4096 image into 256x256 pixel blocks, treat each block as a fundamental problem element.
Don't look at C / C++. Look at higher level languages. Examples: Clojure, Erlang, and others. The overhead of the runtime substrates for the higher level languages is not as important as the need to be able to easily scale the problem by simply throwing more cpu's at it. If my high level language solution is ten times slower than your C++ code, but I can trivially just throw more cpu's at my solution to scale up to any level I please, then I have a winner. You shouldn't even be thinking in terms of C++ thread models. Think "forbidden planet": that machine is going to provide whatever amount of power your monster needs to achieve its goals.
The problem needs to be decomposable into small parts. Ideally very small, hence "embarrassingly parallel" like many pixel based problems. As a counter example, I would offer the problem Spock asked of the Enterprise computer: compute to the last digit the absolute value of Pi. The idea was that more and more "banks" of the computer would work on the problem until capacity was exhausted. I have difficulty imaging how this particular problem could achieve that, since it is not obvious to me how the computation of an infinitely long Pi could be done in parallel.
The understanding of the details on which are built those high-level frameworks is slowly being locked away in the in the walled gardens of giant corporations. The only way to solve problems will be to do so within their set of concepts, because it will be too complex to reverse-engineer just what's going on.
The user is being pushed to increasingly higher levels of abstraction (as you note), which are attached to reality through carefully guarded industry secrets. The world of computing is ever more magical.
I have an idle long-term plan for that: Build an auditable computer from scratch. Would probably take decades though.
It would involve fuse ROMs programmed through CRC protected toggle switches. Then using those ROMs to build periperals like keyboards and monitors that you can trust.
Would involve code correctness proofs as well. I am hoping that as complexity goes up, the formal proofs will greatly reduce debugging time.
Goes off to start dreaming for reals.
Which is great if your program only need run a few times. Otherwise if it runs ten times slower it requires ten times the electricity and ten times the data center capacity. So many people make that mistake, deploying scripting and other toy/fad/academic languages into production and only when the crunch time comes realize that the survival of the company now depends on replacing that hot mess with real code before the hosting bills from chasing the load bankrupts them or starts chasing off the users who are finally swarming in with error messages. Remember the fail whale; don't be Jack. They barely survived the mistake.
If you think that's bad, consider what will do systemd on that platform. On odroid (arm) systemd is not able to systematically close the network connection before rebooting, so the ssh client is left waiting half of the time.
If you do industrial embedded designs, you customize your tasks and scheduler to properly use the right cores. BIG.little and similar schemes are great.
If you do general computing, good luck figuring out the universal rule for Extremely Varied Apps With Greedy Coders ("I'll run the clock on the A72 so my user doesn't feel any lag").
I considered snarking on BIG.little... but it's practically a marketing term. To the extent that smartphone buyers have actually heard of it.
Imagine a Beowulf ARM cluster.
Sir, your imagination is limited.
Imagine a Beowulf cluster of Beowulf clusters.
It's Beowulf clusters all the way down. An infinite recursion. The final cluster of that infinite fuster cluck is built on ARM chips.
Like this [clusterhat.com] or like this [networkworld.com]?
Dumb jokes never die, they just migrate to other websites.