A terrible event has happened. The person who inspired me to use audio amplifiers as cheap motor controllers is very likely to be dead. He may be younger than 40 and it may be suicide. All of his family live in another country and they may have missed his funeral. If he is not dead then it is possible that he is stuck in a mental asylum for an extended period. At best, he may get out of this situation with no pets, no possessions and no home.
Either way, I blame his girlfriend for the situation. He considered their relationship to be very exclusive but she treated their relationship in much more casual manner. I believe that she was unfaithful before I met either of them. She may think that her boyfriend was unfaithful when he was just doing geek things with other people. That includes the time we spent together making minimal progress with robots.
She had easy access to drugs and she may have chemically messed with his sanity. Apparently, the police concluded their investigation without suspicion and gave his phone to her. This would be consistent with gaining all of his possessions from a will or gaining power of attorney.
After writing software to make bitmaps for origami cubes and then making dozens of random designs, including cartoon characters, celebrities, memes (some from shock-sites) and fancy dress costumes (there's no shortage of that on the Inter-tubes), a friend asked if I could make designs which join at the edges. For example, a continuous pattern of water waves or a cube projection of a map of Earth. Actually, my first suggestion was placing Dress-Up Jesus onto the net of a cube. You might think that is blasphemous but the commedian, Bill Hicks asked "You think when Jesus comes back, he really wants to see a cross? That's like going up to Jackie Onassis with a rifle pendant on." On that basis, what do you think his reaction would be to an organized religion which uses it as its primary symbol?
Anyhow, I've been working on bitmap designs which join on all four edges. Examples include:-
I may attempt to convert a Mercator projection of Earth to an origami cube. Obtaining the sections around the equator is easy. Just take the mid section of the map and divide into four strips. The artic regions only require a little more work via GIMP's polar co-ordinate distortion dialog box. Unfortunately, sections may have to rotated and shuffled to ensure that they assembled correctly. However, this doesn't require additional software and, for the final step, inputs don't have to be square or the same size.
I'm going to be off-line until next year but if anyone wants to troll, feel free to send a suitable bitmap and origami folding instructions to the Flat Earth Society.
In the spirit of bug fixes before features, in the origami cube program:-
rotate(im0,im5,3);
should be changed to:-
rotate(im0,im5,1);
This does not affect designs with two-way rotational symmetry. Nor does it adverely affect designs which have similar colors in opposite corners.
(This is the 57th of many promised articles which explain an idea in isolation. It is hoped that ideas may be adapted, linked together and implemented.)
The quest for faster computing and reduced energy consumption may lead to widespread use of optical computing. This has been predicted since at least 1971 but hasn't occurred. This has been due to numerous difficulties.
The most significant difficulty is the scale of integration. While it is possible to manufacture electronic transitors at 22nm or (various exagerations of 15nm), the use of infra-red lasers limited optical computing to a theoretical minimum feature size of 600nm. Use of gallium in one or more guises may reduce this to 380nm. Significantly reducing this limit would require development and safe handling of extremely small X-ray sources or similar. As a matter of practicality, I'm going to assume that optical computing is powered by a 410nm gallium blue laser diode.
While etching of electronic transistors has received considerable funding and development, optical computing competes at increasing disadvantage. If optical circuits are manufactured using two dimensional etching then we have the problem that we will not be able to have optical processors with 100 million transistors unless the optical substrate is very large or manufactured in very small pieces and then stacked and connected in a manner which has not been developed. Alternatively, some form a holographic etching may have to be developed.
I'm going to assume that an optical CPU uses no more than 10000 gates. At this point, we're at the same scale of integration as 1970s CPU designs. An optical CPU may run at 20THz but it will have no more gates than a Z80 and scope to increase gates may be very limited. For example, electronic inter-connections may be significantly slower than optical connections. We may have the foreseeable situation of an optical 8086 (or similar) emulating a much more recent iteration of x86. This would include optical emulation of cache tiers and wide registers. Specifically, the old wisdom of processing data in 4 bit chunks or 16 bit chunks may be re-established as a matter of necessity.
Emulation will allow optimizations which are infeasible in hardware. For example, if four-way SIMD is used to calculate three dimensional co-ordinates, an emulator can peep ahead in an instruction stream and see which pieces are used before a register is over-written. Then it is possible to only calculate the three pieces required and then leave 1/4 of a register unchanged. In hardware, this would require more circuitry and energy than it would save. In software, this could make a program run faster while saving energy.
I presume that main memory will remain electronic. This would maintain expected storage density. However, DRAM may be superceded by persistent memory.
Generic, multi-core designs will become increasingly desirable. For neural simulation, decompression of dendrite weights and subsequent floating point calculations will all be performed by small integer processors. The dabblers who make their own image format, filing system or init system will move to instruction set design. This will create a profusion of incompatible applications, compilers and execution environments which will make Android look quaint. Despite increasing I/O bottlenecks, theoreticians will continue to avocate pure message passing.
I have a friend who is convinced that Gallium is not an element in the periodic table and is a planet in a science fiction series. I understand his concern because I'm quite convinced that chives is a medical condition. ("Have you got chives?" "No, it's just the way I walk.")
Anyhow, enough of this silliness. I've been researching gallium nitride. Actually, I don't think it would be a huge revelation to say that I've been working on software to implement a network protocol on low-power, low-cost devices and the primary purpose of the system is to securely and reliably control and monitor perimeter security, beer brewing and hydroponics. Well, it is mostly about hydroponics and that's where I've found multiple references to gallium nitride.
Blue LEDs have become fairly ubiquituous. Well, it is a slightly purple type blue which is gallium nitride's 410nm spectral peak. However, now that experimental LEDs exceed 50% efficiency and manufacturing quality has improved, blue LEDs have a particularly piercing quality. In particular, lights on emergency vehicles have become blindingly bright. A friend noted that lights used by traffic police have become dangerously bright and if anyone needed the lights so bright then they shouldn't be driving.
Laser diodes and other spectral peaks are also widely used. Green LEDs have switched to exploiting a different spectral peak of gallium compounds. That accounts for the switch from a moderate green to a slightly blue type green. The short wavelength from blue laser diodes allows more data to be retrieved from optical disc storage. Ultra-violet LEDs are used for forgery detection and hair removal.
LED distribution forms a geographic monopoly. Nichia in Japan, Philips in Europe and Cree in North America are market leaders. Of these, Cree is significantly smaller but you'd never guess from their extensive advertising. Hydroponic innovation occurs disproportionately in arid parts of North America. However, when top tier hydroponic lights contain top tier components from the local geographic monopoly, it leads to co-branding which disadvantages any competitor outside of North America.
However, the development of blue LEDs is a checkered (colorful?) history. After many years of development at the direction of Nichia's founder and continuing without support, Shuji Nakamura was the primary recipient of the 2014 Nobel Prize For Physics. He also got a 20000 Yen (about US$200) bonus for his work. Most of my knowledge about Japanese business culture come from anime, gangster films and Niall Murtagh's book The Blue-Eyed Salaryman. From this, I immediately knew that a 20000 Yen bonus was insultingly small. People get more than this for improving TPS Reports. He sued, won the largest bonus in Japan and then lost most of it when his employer appealed. After that, he started his own company and began working on gallium nitride on gallium nitride. After completing the triple (Triad?) of red, green and blue LEDs and getting a Nobel Prize for his efforts, I thought this was just a chemist's scientific investigation. However, after seeing electron microscope pictures of gallium nitride on a silicon substrate, I fully appreciate the quest to continue development. My reaction was "Well, that's like trying to turf the White Cliffs Of Dover." After gallium nitride has been applied to a silicon substrate, the cracks are deep and craggy. The mis-match in atom sizes causes the silicon to shear apart. Of course, this wouldn't happen if doped gallium nitride could be applied to a substrate of gallium nitride. That's currently under development.
My investigation came after attempting to switch 300W of LEDs at 48VDC with MOSFETs. This shouldn't be a huge challenge. 1000W MOSFETs are available and 600W LED systems are available. So, 300W is entirely reasonable. However, V = IR and P = IV. therefore, P = I2R. I performed a calculation to determine the heat that would be dissipated by my chosen MOSFETs and found that I required a heatsink which would dissipate 18W. That would make MOSFET switching about 94% efficient. It also required an elaborate arrangement for heatsinks or a saccraficial area of circuit board to work as a less effective heatsink. From my calculations, I found that I could improve efficiency by increasing Voltage and decreasing current. Unfortunately, 48V is at the limit of wiring regulations which allow LEDs to be exposed. Higher Voltage would reduce efficiency because LEDs would have to be placed behind glass or plastic. This could also be a moisture trap.
I remember that Google co-ran a competition a while back where the objective was to do efficient power conversion. Apparently, two teams exceeded the specification by a factor of three. I didn't understand the details at the time but I've now learned more about MOSFETs. The general approach seems to be use of an asymmetric "five legs topology" to perform switching. I presume this works like a Commodore 64 PSU's Voltage sense output. This allows a Commodore 64 to automatically determine if it is running from 50Hz or 60Hz mains electricity. From this, it can automatically output a PAL or NTSC signal. The Google Little Box Challenge winners seem to be using something similar (plus a neural network) to optimize switching times. They also use counter-wound coils, capacitors which change value under load and gallium nitride FETs.
Oh. It might be possible to raise the switching efficiency of gallium nitride LEDs by using gallium nitride transistors. This appears to be a material of the future.
The best introduction on this topic (for someone already familiar with gallium nitride LEDs and MOSFETs) was from (the rather plucky) Efficient Power Conversion's Application Note AN002: Fundamentals Of Gallium Nitride Power Transistors by Stephen L. Colino and Robert A. Beach, PhD which explains the manufacture and characteristics of the EPC1001 and EPC1010 GaN FETs. These are just example products and they'll make anything you want. They freely admit that they're bootstrapping from the existing infrastructure. That means all of the products use a silicon substrate. From Shuji Nakamura's work, that's known to be highly inefficient. However, within these constraints, they'll modify their manufacturing process to meet the characteristics of your choice. Specifically, "Where this does not allow compliance with safety agency creepage distance requirements, underfill can be used." If you're willing to use this experimental technology then manufacturers are willing to adapt products to your requirements.
I only investigated this because I thought that 94% efficiency was unreasonable. However, I accidentally found something which may solve multiple problems. Specifically, EPC's Application Note AN002 suggests that GaNFETs can be used in D-Class Amplifiers. That means GaNFETs can be used for audio amplifiers, quadcopters and robots. (My current choice for quadcoptor control is misuse of a TDA7379 audio amplifier. That's about 95% efficient and requires a heatsink. Raising the efficiency would reduce or eliminate a heatsink. This would be particularly welcome for a quadcopter.)
GaNFETs are already suitable for switching domestic mains electricity at 1MHz. Indeed, GaNFETs outperform MOSFETs in all characteristics except gate leakage current - and I suspect this deficiency will be resolved when gallium nitride on gallium nitride is resolved by a Nobel Physicist or one of his rivals. Even without this, there is talk among experts of GaNFET switching at 1GHz or even 1THz. I foresee GaNFETs, CRT [Chinese Remainder Theorem], PWM [Pulse Width Modulation] and/or SDR [Software Defined Radio] converging. That would allow micro-controllers to shape and modulate radio waves over a very wide range of frequencies while only using one leaky power component. The leak will get fixed and the switching frequencies will continue to rise. Within 30 years, it may be possible to identify a person's sex and race from the resonant frequencies of their DNA. Within another 30 years, it will be possible to accurately diagnose illness by bringing a device close to a patient. That sounds very much like like a medical tricorder. That would be great if it was only used for good but there was a proposal to make a smart bomb which only triggers when it is within range of one or more passports with RFID chips of a chosen nationality. A similar result can be obtained with facial recognition and, one day, it will also be possible by remoting sensing DNA.
From my reading of Nexus Magazine, Volume 24, Issue 6 and Nexus Magazine, Volume 25, Issue 1 (current issue), EMF [Electro-Magnetic Fields], Wi-Fi and dirty mains are all very unpopular concepts. I think they'd be even less impressed about MIMO phased-array beam-steering.
EMF is unavoidable. Wi-Fi is definitely avoidable. (Thank you, super-powers, for letting us plebs use frequencies which are useless for long-range military applications, such as 2.45GHz [a resonant frequency of water] and 60GHz [a resonant frequency of oxygen].) Dirty mains is a real problem. Very few people can design power switching circuitry properly. I can't do it. However, I'm not under time pressure and I have the luxury of asking questions. The typical scenario is a dabbling amateur who is an "expert" within a company. Under time pressure, the "expert" designs some square wave switching circuitry. That wastes about 30% of the energy but, hey, the customer rewards first-to-market and the customer pays for the externalities. The design might be sent to a standards laboratory which has commercial incentive to obtain repeat business. A modified design might be made to specification by a manufacturing sub-sub-contractor. After a device is stocked in a warehouse, it will get bashed around by a next-day delivery courier who competes on speed and price. Then it will get further abuse during daily use. The 30% excess energy radiates along any available cables. Historically, the power and frequency range of dirty signals were relatively capped. But with GaNFETs, it is now possible to switch thousands of Watts at unprecedented frequencies. And software control of GaNFETs is a particular problem if software can be compromised. From Edward Snowden's documents and other sources, we know that:-
I don't know enough to propose a solution. However, I know that one of the regular advertisers in Nexus Magazine won't help you. The Polarix Disc is a rather pretty two inch diameter circle of single-sided, copper-clad fiberglass etched with concentric hoops. However, its main effect appears to be the transfer of US$30.
(This is the 56th of many promised articles which explain an idea in isolation. It is hoped that ideas may be adapted, linked together and implemented.)
There is a computer problem which was known to Charles Babbage when he was designing his unsuccessful Analytical Engine. Most impressively, he had a solution.
When adding large numbers (in decimal, binary or any other base), there are a large number of carry operations from one column of numbers to the next. In the worst case, every column of interest is interlinked with its neighbors. This leads to a phenomena of ripple carry where propagation of carry requires a large number of iterations. For example, a decimal odometer may occasionally tick over from xxxxx999 to xxxxy000. For a mechanical system, this occurs synchronously where there is sufficient force to move every digit. However, even this has limitations.
Carry is no more or less likely to occur in a binary machine. However, a binary machine requires the most digits to represent a given number and therefore the worst case has the most impact. In any base, for each column of addition, there are two numerical inputs, one numerical output, one carry in and one carry out. In any base, carry is zero or one. For the least significant digit, carry in is always zero. However, this only skews the average case in our favor and does nothing to prevent the worst case.
For any base b, there are b^2 permutations of digits to add. Within a given range of precision, digits may occur with fairly equal probability. (This is particularly true for floating point operations.) Taking into account carry input, there are 2(b^2) inputs for each column of addition. For a digit of decimal addition, there are 200 permutations of input. For base b, where carry in is zero, there are b(b-1)/2 outputs with carry out of one. In decimal, this is 45 permutations with carry out of one. For base b, where carry in is one, there are b(b+1)/2 outputs with carry out of one. In decimal, this is 55 permutations with carry out of one. In total, exactly 100 of the 200 input permutations produce a carry.
This was known to Charles Babbage and he devised a mechanical system to accelerate this process. Although Charles Babbage had difficulty obtaining funding and difficulty manufacturing an Analytical Engine, this system has been partially implemented in metal and a Difference Engine was made in Lego Technic.
When the scale of integration was low, mini-computers typically used 74 Series logic chips. One of these designs is a 4 bit full-adder. It accepts 4 bits from one source, 4 bits from another source, has carry in, carry out and provides four bits of data out. Of course, exactly half of the input permutations produce a carry out of one.
Historically, propagation from carry in to carry out would take more than 70 nanoseconds and less than 10ns of this propagation delay occurred due to the silicon chip anti-static protection circuitry when a signal went off of a chip. Anyhow, adding 16 bit numbers could take more than 280ns. Addition of 64 bit numbers was infeasibly slow. It also required circuitry which was infeasibly large and expensive. It was preferable to work on smaller units and make, for example, addition of 16 bit units, a composable operation via a carry flag in a flag register. On this basis, it was possible to add (or increment) numbers of arbitrary size in a manner which was relatively fast for small numbers, only incurred overhead for the most rare cases, and used available hardware efficiently.
My introduction to this topic came via an unlikely path. Chinese Remainder Theorem can be used as a method to avoid long chains of ripple carry. CRT is very good for counting, addition and subtraction. CRT is exceptionally bad for comparison operations and the Chinese used it to calculate fractions despite this limitation. In the classic arrangement, each number (or fraction of a number) is represented by the remainders of division by five, eight and nine. These three co-prime modulos have a product of 360. (This technique was probably obtained from the Babylonians and even then it may not have been original.) Addition and subtraction can be performed on each digit of the triples. After normalization, a lookup table was used to retrieve common answers. So, for example, 60 + 90 = 150. As triples, (0,4,6) + (0,2,0) = (0,6,6). The classic arrangement is partly a notational problem. Regardless, nine bit computation is reduced to parallel computation requiring no more than four bits. In principle, this can be vastly extended.
Apparently, CRT can be used to measure sonar ping times much more accurately without consideration for long chains of ripple carry. Unfortunately, it does this by pushing a problem elsewhere. Fortunately, there are other techniques to amortize the problem of ripple carry. One technique is to use a vector pipeline. If a vector pipeline has four execution units with staggered timing (like a four cylinder combustion engine) then ripple carry may propagate within units while more data is fed to subsequent units. Unfortunately, in the case of one addition, the majority of the system is unused. Furthermore, no speed advantage occurs.
74 Series provides a carry accelerator chip. The chip itself is composable. Therefore, addition of any size may be arranged as a quadtree where 4 bit full-adders are the leaves and carry accelerators are the branches. So, one tier of branches (one chip) accelerates addition up to and including 16 bits. Two tiers of branches accelerate addition up to and including 64 bits. Three or more tiers are used for larger numbers.
Historically, many CPUs had a distinction between data registers and address registers. Likewise, old IBM systems allowed integer and floating point calculations to occur in parallel so that ForTran programs could calculate array indexes and scientific data in parallel. However, with a trend towards general purpose registers, typically with a maximum of a 4 bit operand reference, address computation goes through a general ALU datapath. This may be 128 bits or more - even when an address-space is significantly smaller.
How often does this occur? On a system with eight registers (ARM Thumb BCM8255 Raspberry Pi with Raspbian and gcc), I spent more than three days getting a bit transpose operation working efficiently and then further effort to get it working portably on AVR and other architectures. The most significant part of the solution was to reduce register pressure by performing address arithmetic. This allowed all input to read via one register and all output to be written via another register. However, on a CPU with wide registers, incrementing an address pointer may be a slow operation with little opportunity to amortize any delay.
At a higher scale of integration, addition does not have to be performed in 4 bit units. However, stretching carry acceleration from a quadtree to a quintree or similar provides minimal gain. Most annoyingly, 5 bit full-adders and two tiers of 5 bit acceleration only provides 125 bit addition. Likewise, 7 bit full-adders and two tiers of 6 bit acceleration only provides 252 bit addition. It is inescapable that addition for larger numbers is slower. Variants can be devised to match or exceed 128 bit, 256 bit, 512 bit or 1024 bit thresholds but I'm not sure that any provide superior performance to a quadtree. Specifically, if it takes four units of time to add 4 bit numbers then it takes eight units of time to add 16 bit numbers, 12 units of time to add 64 bit numbers, 16 units of time to add 256 bit numbers and 20 units of time to add 1024 bit numbers. Whereas, the 125 bit example requires 15 units of time and the 252 bit example requires 19 units of time. The only advantage I could find was to use two tiers of 3 bit acceleration but this only fills one position between each pair of 4^n bit numbers. For example, 10 units of time to add 36 bit numbers and 14 units of time to add 144 bit numbers.
Timings for large additions may be hidden among instruction decode and a round-trip between registers and ALU. However, a larger problem remains unresolved. Pointer sizes are becoming ridicuously large. I could semi-sensibly use a 144 bit address-space. For example, mmap() of a 4KB block, 128 bit hash reference, Fibonacci deduplicated, sparse file. Then add more bits for virtualization. Why would I do that? Well, I could just "mount" a data-bank within a process. This could be read-only or read/write. And it would be possible to maintain accounting and billing external to the process. Regardless, long file pointers and long address pointers put significant pressure on data caches. 8 byte pointers are common and this may increase to 16 bytes or much more. With address randomization, this provides a modicum of security. It is otherwise a vast hinderance. For a typical application, such as a database server, it is estimated that the transition from 4 byte pointers to 8 byte pointers has the effect of reducing a data cache by approximately 15%. Indeed, it has been suggested that a multi-core server with 32GB RAM should run two 32 bit instances of MySQL Server rather than one 64 bit instance. (The majority of the RAM is best managed by an Operating System's LRU storage cache.)
As a matter of practicality, processors may be pushed towards split banks of data registers and address registers. This isn't perfect but it may be preferable to a reduction in registers as their size increases. An asymmetrical design allows data registers to be significantly wider than address registers and computations may be performed independently on each bank. Numerous designs have elements of this, including IBM's old systems, Data General Nova, Z80, MC68000 and, generously, Intel's Pentium.
Among the designs which have a distinct split between data registers and address registers, data registers obtain the most use. Under-utilized address registers (of the same width) get used as temporary storage. It would be possible to continue this practice if register width differs. However, truncation will occur.
In the most optimistic case, it is possible to conflate operands and addressing modes to sneak an extra bit of operand reference. For example, addressing modes such as data register, address register, address register indirect or address register indirect with post increment. Firstly, this example only requires two bits to implement a variety of useful addressing modes. Secondly, an increment operation can be performed within a register while data may be processed elsewhere. Incrementing a register by one, two, four and/or more is trivial if ripple carry is not a concern. However, rapidly incrementing or decrementing a register is a much larger problem.
It may be useful to provide two separate stacks for data and addresses. This would be in addition to a return address stack. For ease of implementation and maximum performance, each type of stack could be aligned to the width of each type of register. It may be possible for references into an address stack to be read-only or entirely absent.
In summary, the search for fast arithmetic has been ongoing for millennia. Unfortunately, the problem is getting worse rather than better. This is due to the potential volume of data and a secondary effect where the volume of data requires larger references.
(This is the 55th of many promised articles which explain an idea in isolation. It is hoped that ideas may be adapted, linked together and implemented.)
There is a famous result that 1/2 of instructions executed are move instructions and 1/6 of instructions executed are branches. A subset of the remaining 1/3 are ALU, FPU or similar instructions.
Maximize ALU And FPU Throughput
Ideally, a processor should allow full utilization of an ALU and/or FPU. However, an instruction stream only provides suitable instructions in 1/3 of cases. Attempts to increase instruction throughput may lead to additional ALUs which are also under-utilized. An example is the Intel Pentium which provides a supplementary integer unit. Processor architectures, such as MIPS10000 permit zero, one or two FPUs on the basis that FPU operations take two or more clock cycles whereas ALU operations generally take one clock cycle. One processor architecture allowed up to 16 co-processors and any of these may be FPUs. However, the physical arrangement of co-processors incurred a one clock cycle penalty per co-processor as instructions were handed along a daisy-chain. In the case of multiple FPUs, each successive co-processor provided less throughput.
Increased throughput may lead to similar paths of development on multiple processor architectures. Development of SIMD and vector pipelines on x86 was almost identical on ARM. In the first iteration, a processor has no FPU support. In the second iteration, FPU functionality becomes increasingly integrated. In the third iteration, SIMD uses existing FPU registers. (Intel MMX, AMD 3D Now and ARM Neon.) This allows SIMD functionality to work with legacy operating systems. In the fourth iteration, SIMD obtains its own registers. In the fifth iteration, SIMD registers are expanded beyond the size of FPU registers. In the sixth iteration, a vector pipeline avoids the increased state of SIMD.
Attempts to execute multiple instructions may be taken to extremes. Out-of-order execution on x86 uses a 3 bit register reference as a tag within a RAT [Register Allocation Table] of 144 shadow registers or more. Meanwhile, data is transported to one of the numerous, specialized execution stations. Execution stations are provided in quantities which will maintain good throughput for typical workloads. Given that x86 effectively has more than 1000 instructions, intricate structures have developed to execute them effectively. However, this leads to curious holes in architectural models. For example, an Intel Pentium 3 incurred a large delay when switching between FPU instructions and SIMD instructions because the contents of the registers were transported to different parts of the processor.
As register width has increased to 32 bits or more, it is sufficient to represent IEEE754 single precision values (or larger) in one general purpose register. Indeed, given the monotonic representation of IEEE754, integer and floating point operations may share hardware. Only positive infinity, negative infinity, not a number and, possibly denormalized values require special handling.
(The monotonic, 2-compliment representation of IEEE754 allowed an integer one bit right shift (integer divide by two) to be used as an approximation for a function used in a Newton-Raphson method for square root calculations; most commonly used in Quake or ray-tracing distance calculations. For square root approximation, the right shift of the positive mantissa into the top bit of the exponent is inconsequential and the right shift of the bottom bit of the exponent into the mantissa remains monotonic. This leads to a Newton-Raphson approximation which always converges. However, it requires the use of a very opaque constant. This technique fell out of general use when direct manipulation of FPU state incurred more clock cycles than using the hardware floating point square root function. However, this wasn't soon enough.)
The distinction between ALU operations and FPU operations is decreasing. It is also possible to eliminate a distinction between ALU and SIMD. Like MC68000 and ARM, a FirePath instruction typically specifies data size. However, rather than only process one unit of data, multiple units of data may be processed in parallel. This may seem like a waste of energy but many ALUs perform computation on a full datapath even if a register only latches the bottom bits of the computation. So, instead of conditionally spliting a register into pieces, it is more efficient (and more symmetric) to conditionally split an ALU into pieces. For addition operations, carry operations may be conditionally inhibited on byte or word boundaries. This incurs very little penalty and may increase computation speed where registers are wider than 64 bits.
Share ALU Among Hyper-Threads
I previously considered an in-order, super-scalar processor architecture where one execution unit implements a full range of instructions including branches, traps and fences in addition to all ALU and FPU operations. Successive units implement a diminishing subset of instructions. This may extend beyond eight units. Unfortunately, this arrangement is grossly inefficient for multiple reasons. In the general case, the first ALU will obtain almost 1/3 utilization and subsequent units will be used sporadically. Total ALU utilization will be about 1/3 of one unit. Ignoring this, parallel execution of instructions is limited by source and destination references. For operations on registers only, each instruction may remove one or more registers from the pool of uncontested references. The pool is quickly exhausted by typical reference patterns. There is a small upper bound when register references are typically three or four bits. Whereas, instruction density is greatly compromised and context switching is greatly increased if register references are larger. Other addressing modes create further problems. For example, multiple reads via virtual memory. In this case, one or more memory exceptions may occur along a sequence of instructions executed in parallel. Without the benefit of out-of-order execution, FPU timings lead to further diminishing returns.
Share Exotic Instruction Implementation Among Hyper-Threads
While considering super-scalar processor architectures, hyper-threading, ALU throughput and attempts at energy conservation through the use of lightweight cores, it occurred to me that something akin to AMD's Bulldozer processor architecture can be generalized. AMD learned from one of Intel's mistakes. Intel's Pentium 4 was unpopular because it failed to improve execution speed of legacy code. It was a good design which worked well with the output of contemporary compilers. However, some legacy code ran slower on an Intel Pentium 4. AMD was careful not to repeat this with the Bulldozer processor architecture and devised an efficient scheme for SIMD. (For legal reasons, I'm going to be careful with the terms processor, core and hyper-thread.) For every pair of processor hyper-threads, SIMD execution units would fuse together for the execution of 256 bit SIMD. This did not adversely affect legacy code if it did not have SIMD instructions. This did not adversely affect legacy code with 128 bit instructions. Crucially, it did not significantly affect legacy code with 256 bit SIMD. This arrangement disproportionately reduced the size of each processor hyper-thread (and residual energy consumption) while allowing more hyper-threads per chip or more chips per wafer. Unfortunately, AMD was successfully sued by consumers who thought they were being defrauded. (Given that AMD had made extensive effort to ensure smooth operation for a transitional design, I doubt that many sued in good faith. Furthermore, it is my preferance that AMD won this case and that consumers suing harddisk manufacturers over deceptive storage capacity were successful. However, that would affect more parties over a longer period of time.)
My intention is to fully utilize an ALU. This is above all other considerations. Indeed, the very important consideration of code density is for the purpose of maximizing ALU throughput. If there is a bottleneck from main memory to instruction decode that ALU capacity is wasted. However, when instructions are brought to decode, approximately 2/3 of them do not instruct an ALU to do anything. The trivial solution to this problem would be share one ALU among three instruction streams. For workloads of three or more threads, this arrangement approaches 100% ALU utilization. If ALU saturation is the objective (and practical implementation), it is possible to skew the ratio and have three ALUs and 10 sets of registers. My experience is that common instructions may be implemented locally per register set or per pair of register sets. Exotic instructions can be shared more broadly. As a practical consideration, it is desirable to keep wires short. This reduces latency. It is also desirable to minimize the number of transistors in critical paths; ideally to 14 or less. This reduces switching times and allows a processor to operate at faster speeds.
If each register set has a local addition unit and a local left shift unit, this is sufficient to implement arbitrary addressing modes. This allows most data structures to be traversed without invoking a global lock on a shared ALU. Multiple multiplication units would be a benefit for many applications. The traditional workaround to FPU being slower than ALU is to have two FPUs. Where there are an even number of register sets, half may have access to a particular FPU and all have access to one ALU. So, we could have a scale-out arrangement of eight register sets, eight integer addition units, eight left shift units, two integer multiplication units, two FPUs and one ALU. I do not suggest four or more integer multiplication units because a flash multiply unit requires a significant number transistors and large bursts of energy.
Floating point multiplication requires addition of exponents, multiplication of mantissa bits and a round of post-processing. Floating point multiplication may be slower and less precise but it requires less transistors and is less energy intensive. On this basis, the number of integer multiplication units should not exceed the number of floating point multiplication units. If a floating point unit has two or more multiplication units (to exploit Euler's identity of exp(i z) = cos(z) = i sin(z) or otherwise accelerate calculations) then it may be sensible to correspondingly increase integer multiplication units.
With two integer multiplication units, two FPUs and one ALU, there would be five locks on shared resources. If there was no correlation between instructions or instruction streams (and all instructions had the same execution time) then eight hyper-threads incurring a lock on 1/3 of the instructions would incur 8/3 of lock contention over five locks. With local move, local addition and/or other uncontested instructions, a lock may be required for less than 1/5 of instructions. Total lock contention would be correspondingly reduced or permit additional hyper-threads. Unfortunately, this is optimistic. Hyper-threads are more likely than average to execute the same ALU operation or FPU operation. Perversely, this makes fine-grain resource locking less useful than average.
For reliable execution, all registers, execution units and datapaths should be doubled. If any sustained bit error eminates from a register set then an exception occurs. It is not recommended that reliable and unreliable hyper-threads are mixed within the same system. Reliable hyper-threads may be implemented in a form which wastes a binary tier of cache coherence and/or main memory interface. Reliable hyper-threads also require execution unit locks to acquired in pairs. This significantly reduces utilization of execution units and, in the general case, invites deadlock scenarios. Furthermore, deployment of a reliable kernel and unreliable application greatly complicates interrupts, system calls and process threading. Specifically, switch from user-space to kernel-space may require a neighboring hyper-thread to be suspended so that kernel code can be executed while maintaining a second copy of the registers. This may also cause deadlock. Specifically, where the number of hyper-threads varies, priority inversion may occur.
Fused Instructions
It may be useful to consider fused instructions. This is where two or more common instructions are combined into one operation. I shied away from fused instructions for many reasons:-
Despite these concerns, it may be beneficial to investigate. An example of a fused instruction would be a conditional move. Moves are common. Conditional branches are common and conditional branches may be made around move instructions. When moves typically take one clock cycle and branches typically take two or more clock cycles, a conditional move has considerable advantage even if the throughput of datapaths is modulated.
PIC generalizes this to the extent of having an "ignore the next instruction" instuction (which can be daisy-chained). This is typically used to implement conditional branches but may also be used to implement conditional move and conditional arithmetic.
I was appalled that ARM implemented conditions as a 4 bit field on every instruction. There's a difference between theory and practice. For example, to obtain sensible code density, it is preferable to not have condition codes on all of the millions of system calls. Admittedly, this consistency has allowed implementation of Thumb mode. However, this has replaced one inefficiency with another. From disassembling Thumb code, I confirm that the 16 bit instruction format has a 4 bit conditional field and one value for this field ("never?") defines the 32 bit instruction format - but without the ability to execute it conditionally. Meanwhile, this 2:1 instruction size difference has inferior code density to Xtensa's 3:2 instruction size difference.
Move is historically implemented via an ALU where one of two inputs is faithfully propagated. On 3-address RISC architectures, move may be implemented as addition with a constant zero or addition with a dummy zero register. However, even if the datapath for a move operation conceptually goes around an ALU, it is wasteful to actually send it around an ALU. Furthermore, for a 3-address processor architecture, it may be more efficient for any 2-address operation (move, not, negate) to use a spare operand field as a conditional field. A 3 bit field allows implementation of eight conditions: "always", "never" and six inequality tests. "Never" may be used as an escape value or a privileged instruction, such as a system call. A 4 bit field allows a wider range of tests.
Multiply and accumulate is a common fused instruction for integer and floating point operations. Technically, it should be 4-address instruction because it reads three operands and writes one operand. Thankfully, it is quite practical to implement within a 3-address processor architecture. Unfortunately, as currently proposed, the datapath for FPU MAC and integer MAC may or may not have symmetry. For floating point MAC, data conceptually flows from a multiplication unit to an addition unit, in one clock cycle, within an FPU. Whereas, for integer MAC, an integer addition unit may be local to a hyper-thread on the basis that integer addition may also be used for address calculation. However, efficient instruction decode may encourage symmetry between integer MAC and floating point MAC. Therefore, an integer addition unit located with a shared integer multiplication unit may provide the most symmetry with FPU functionality. It also avoids having eight or more addition units which must switch between two sets of inputs: either two source operands (for addition) or the multiplied output of the two source operands added with the destination operand (for multiply and accumulate). However, if an addition unit local to a register set does not have multiplexed inputs then the input to a destination register requires multiplexing from an additional remote output. This is a typical example where savings (transistors, latency) are not as great as initially expected. It also shows that purmutations must be enumerated to find an optimal solution.
Pre-scaling one input is common on ARM. A 4 bit field allows a wide variety of even left shifts. (With a penalty of one instruction and one register, less common odd left shifts can also be obtained.) Where there is an asymmetry of addressing modes, it is most useful to have pre-scaling on the operand which has the widest range of addressing modes. This includes constants. Given that left shift can be applied everywhere, it is not required as a separate instruction. Likewise for addition.
Fibonacci Length Instructions
Consideration of opcode bit patterns should allow local units and shared units to be arranged in a manner which does not create tortuous decode logic. Ideally, want 1 byte, 2 byte, 3 byte and/or 5 byte instructions which are BER and where extension bits form part of the opcode. To reduce BER overhead where bits provide no gain to opcode, may have Fibonacci sequence of 1 byte, 2 byte, 3 byte, 5 byte or 8 byte instructions. Could implement 1 byte, 2 byte, 4 byte or 8 byte instructions but this reduces instruction density and doesn't greatly simplify execution of an unaligned instruction stream.
Within the first byte of an instruction, could have 00sssddd for move, 01sssddd for 2-address integer addition and 1xxxxxxx for longer instructions. Furthermore, sss and ddd may represent even and odd registers. So, 00-000-000 represents MOV R0,R1 rather than a less useful MOV R0,R0. In this case, instruction density matches or exceeds Z80 or x86. It also provides implementation options for local units.
Next byte provides additional bits for opcode and operand. This may extend beyond register only and immediate mode. Next byte provides 3-address operand (or condition code) and pre-scaler. There is no advantage by providing these fields in shorter instructions. Next byte may provide more addressing modes and immediate data. For the different instruction lengths, opcode fields may be regarded as separate instruction-spaces. Although it is not essential for there to be a correlation between opcodes presented by instructions of different length, it provides implementation options. It may be beneficial for even opcodes to specify integer operations and odd opcodes to specify floating point operations. It may also be beneficial to have a correlation between integer opcodes and floating point opcodes. For example, the following instructions may have contiguous opcodes:-
And correspondingly:-
The size of the datapath may be split across multiple bytes of the instruction. In the trivial case, it may be beneficial to specify 8 bit or 16 bit operation within the first two bytes of an instruction. Potentially, a 16 bit processor would not have to decode an instruction which is longer than 16 bits. However, if instructions are not aligned on even byte boundaries, this could be stretched to a one bit field in the third byte. On 32 bit processors and 64 bit processors, an additional bit may appear in longer instructions. On implementations up to 1024 bit, where datapaths are assumed to be wide and/or fast, a third bit to specify data size may only occur in much larger instructions. A complication to this arrangement is that floating point values are typically 16 bit, 32 bit, 64 bit, 80 bit, 96 bit, 128 bit, 192 bit or larger. 80 bit is deprecated but other values may have to be shuffled around. Regardless, the default width is the size of the machine and therefore longer instructions may only be required to coax a machine into SIMD.
Historically, it was useful to arrange opcodes to maximize use of logic gates within ALU. For example, gates may be shared between addition and exclusive or. Nowadays, it may be more useful to arrange opcodes by the number of operands and the envisioned distribution of computational units.
Instructions As Binary Fractions
When moving from fixed length instructions to variable length instructions, I had the epiphany that instructions should not be considered as numbered patterns from 0 to 65535 (or suchlike) with optional extraneous data. Instead, instructions should be considered as a binary fraction from 0 to 1. Ideally, one or more patterns within the shortest length instruction should be privileged. To do otherwise creates difficulty when attempting to set an unlimited number of breakpoints. (Multiple breakpoint addresses? Reserved pattern? Debug mode? Trampolines?) Furthermore, it is beneficial to have certain instructions within contiguous ranges. For example, breakpoints, traps, fences and branches. In the trivial case, one contiguous range of instructions is sequential. This provides the most implementation options when implementing a scale-out implementation.
Stack Width And Register Rotation
Would like to implement register rotation, as popularized by 3-address SPARC. Indeed, if stack operations are exclusively register width and there are separate stacks for data and program return addresses (possibly different widths) then it is trivial to tie register rotation to the bottom bits of a data stack pointer. Unfortuately, this convenience hardware would impede super-scalar implementation. Indeed, MIPS and SPARC provide numerous examples of features which provide advantage in the trivial case but hinder more advanced implementation. An example from MIPS would be its deprecated branch delay slot.
If consistent instruction fields are used for memory and stack operations and instruction fields define datapath width but stack operations are always the full width then the unused datapath width field of a stack operation may be used to modify a data stack pointer by multiple values. Indeed, available choices may be skewed for push and pop operations. This would eliminate tedium around stack frames in many cases.
Summary
Processor design is typical technical subject where one design decision excludes another. Within general purpose processor design, decisions about register types and quantities leads to decisions about instruction format which lead to decisions about instruction execution. This chain of decisions extends below Boolean logic and down to the length of wires on a silicon chip, number of transistors switched and any stray alpha radiation. (That is not an exageration.) A design may look excellent but have a terrible implementation. For a given design, no good implementation may be possible. It is invariably possible to extend a bad architecture via escape patterns. However, fixing the first iteration provides the most gains. Unfortunately, this requires consideration over many tiers where nothing is immutable. This can infuriate; especially when design veers towards an established design which is known to be flawed. Thankfully, if there is great foresight, it is possible to design something trivial which keeps the most useful options open.
I've found that wide registers allow amalgamation of integer, float and SIMD. I've found that designing synchronous hyper-threading is vastly easier than designing any form of super-scalar. This is particularly beneficial in conjunction with NUMA or virtual memory. I've also found that it is possible to apply pre-scale and/or accumulate to all relevant instructions with minimal penalty. It is also possible to fill spare fields with condition codes or stack modifiers. Oddly, this occurs in the most beneficial places. While some may be concerned that my largest proposal for a processor requires more than 16 kilobits of state for each of eight or more hyper-threads per core, the smallest implementation is much closer to an RCA1802 than a Sun Niagra, Intel Itanium or one of ARM's scale-out designs.
(This is the 54th of many promised articles which explain an idea in isolation. It is hoped that ideas may be adapted, linked together and implemented.)
It is a terrible situation when a person's best vaporware falls behind a shipping product. In my case, my vaporware fell behind the Xtensa processor architecture and I've spent at least a week rectifying this situation. I hope this isn't entirely idle work but there is definite satisfaction when devising something which might squeak ahead of a competitive product.
However, my objectives for processor design may not be obvious so I list them here:-
Design a practical processor architecture which can be implemented. My previous design (with three source datapaths and two destination datapaths) does not scale to the extent of other architectures.
Design a reliable processor architecture. I'm not following the classic telephone company philosophy of ensuring that every component exceeds 99.999999% reliability (with the intention that a path through a system exceeds 99.9999% reliability). However, if there is an obvious or non-obvious method of greatly increasing security or reliability then it should be taken. An example is watchdog timer functionality. An embedded system with a watchdog timer may be significantly more reliabile than a system without a watchdog timer.
Design a scalable processor architecture. As an example, the Alpha AXP processor architecture was intended to provide 1000 times the performance of a DEC VAX. 10 times performance was expected via scale of integration and associated increase in processor frequency. 10 times performance was expected via clean implementation to aid super-scalar design and suchlike. 10 times performance was expected via multiple processors sharing the same banks of memory. In another example, compiled code for 16 bit stack-based a Transputer was upwardly compatible with a 32 bit Transputer. Indeed, the Transputer processor architecture was very much like a Burroughs B6000 transported into the era of the 80286. Like its spiritual mini-computer predecessor, Transputer implementations are very interchangable because code utilizes very few architectural quirks.
It would be desirable to implement a processor architecture which works as a simple 16 bit embedded processor with minimal energy consumption and also works as a 64 bit (or more) architecture in clusters of 4096 nodes (or more). Clustering requires very good MTBF otherwise a system will crash or produce junk results very frequently. Likewise, if there are a significant number of nodes scattered like dust then reliability may also be problematic. In the worst case, sensors behind enemy lines or poured into concrete may be very difficult to replace. Even if it is easy to perform maintenance, it is cost effective to have a US$60000 per year technician who reboots lightbulbs?
Design a secure processor architecture. Popek And Goldberg virtualization requirements are not negotiable. AMD agrees but Intel thinks that your security is a matter of market segmentation. Overall, techniques are strongly encouraged if they reduce or eliminate buffer overflow, heap overflow, stack execution, gadget execution or allow malicious code transformation into 7 bit ASCII, UTF-8 or UCS2. That may require multiple stacks, tagged memory and/or encrypted instruction streams. I'm particularly reluctant about abuse of the latter but Bruce Scheier (in the book Applied Cryptography?) states that it is better to have 10% of the resources of a trusted computer rather than 100% of the resources of an untrusted computer. Unfortunately, "It is difficult to get a man to understand something, when his salary depends on his not understanding it!" and many people don't understand that trust comes from the user and is for the benefit of the user.
Design a clean processor architecture. This encourages design of practical, reliable, scalable, secure processor architecture while also allowing upward compatibility. If there is a "cool" feature then it should be avoided because it may be an architectural dead-end.
My preference is for a RISC processor architecture with flat memory, fixed length instructions, 3-address machine, general purpose registers which exclude program stack and program counter, a dedicated zero register, no flag register, no shadow registers, separate program and data stacks, register rotation, optional FPU, no SIMD and no vector pipeline. The primary reason to deviate from these choices is to increase code density. This is particularly important to maintain competitiveness. Without competitive code density, instruction storage, caching and throughput is greatly reduced. There is little to be gained by saving a few transistors inside of an ALU if millions of additional transistors are required outside of an ALU. Other reasons to deviate from a clean processor architecture include leveraging existing design elements, greatly increased throughput, compatibility with legacy code and/or maintaining the expectations of a general purpose computer.
Design for real applications not benchmarks. If a design works well with a mixed workload then it should perform moderately or better with a parallel workload. However, if a design works well with a parallel workload then it may not perform well with a mixed workload. As an example, GPUs work well with parallel tasks, such as graphics, but are otherwise slower than general purpose computers. In general, it should be preferable to add more processor cores rather than add more ALUs, more FPUs or more vector units.
A scale-out version of the previous design provided one hyper-thread and eight or more ALUs. However, utilization of ALUs is grossly inefficient. Across multiple architectures, approximately 1/2 of instructions executed are move operations and 1/6 of instructions are branch instructions. Therefore, in the general case, 1/3 of instructions are ALU or other instructions. Scale-out version of the next design provides eight hyper-threads, one ALU and supplemental units. If there is one active thread then performance is almost equal to the previous design. However, if there are eight or more active threads then the ALU approaches 100% utilization. Per transistor, this is more than eight times more efficient. In general, it is preferable to be bounded by computation rather than memory bandwidth because computational units may be more easily duplicated when memory bandwidth is not exhausted. (This also applies on a larger scale. Specifically, it is easier to scale application servers rather than database servers because application servers may not share any state.)
Implement old ideas. This is especially true if patents have expired.
Use the best elements from other architectures. When appropriate, I have mentioned processor architectures such as RCA1802, TMS9900, 6502, Z80, MC68000, x86, AVR, ARM, MIPS and Xtensa. To a lesser extent, I have also been influenced by IBM System 360, Data General Nova, Burroughs B6000, Transputer, XMos, PDP-11, VAX, Alpha AXP, PIC, Xerox Alto, TMS32020, MC88000, SPARC, PowerPC, Intel 432, Intel 860, Intel Itanium, MIL-STD-1750, FirePath, MMIX, NandToTetris, RISC-V and a mad balanced ternary design by the Russians.
It is possible to find similarities between processor architectures. For example, there are similarities between instruction fields on Z80 and MC68000. Or similarities between the three 16 bit index registers on Z80 and AVR. Or the pattern of one data register, one address register on NandToTetris, 2 data registers, 2 address registers on a Data General Nova and eight data registers, eight address registers on MC68000. Or the hybrid of MC68000, x86 and RISC available in XMos.
It is also possible to synthesize ideas. For example, a 3-address processor architecture with eight or more stacks may be feasible. Or consider an SIMD version of MC68000 where ADD.B performs four 8 bit additions rather than modifying the bottom 8 bits of a 32 bit register.
Don't guess. Use empirical reasearch. I've comprehensively shown that it is practical to restrict subroutine calls, loops and forward branches to instruction cache-line boundaries. Indeed, even on architectures which do not enforce this restriction, it is beneficial to do so. Specifically, MySQL Server natively compiled with gcc on a Raspberry Pi with 64 byte boundaries leads to stored procedure execution times being reduced by 45%. An architecture which pre-scales addresses would obtain better performance (and possibly better security).
It is tedious but possible to simulate queues, memory access patterns and other resource contention. Indeed, it is possible to simulate a hyper-threaded processor running real applications, such as a word processor or web browser. From this, it is possible to collect statistics, such as instruction length, opcode frequency, conditional branch frequency, register usage, addressing mode frequency, cache hits, ALU contention, FPU contention and memory contention. It would also be possible to collect more detailed statistics, such as determining the most frequently executed instruction before or after usage of a particular opcode, register or addressing mode. It is also possible to collect cycle accurate timings through particular sequences of code.
Avoid side-effects and conflation. This can be difficult because established practice become ingrained. A good example would be a flag register. This is state which lingers between instructions. Furthermore, modifications to flags occur as a side-effect of instruction execution. This may create difficulties for super-scalar implementation. It may also increases the duration of context switching. A further difficulty with flags can be found by comparing MC68000 against MC68010. The minor step revision correctly enforces privileges on half of the 16 bit flag register. This allows virtualization to work correctly.
A practice discontinued after ARMv1 or so was the conflation of program counter and user flags. Furthermore, RISCOS continued the practice (from Acorn's 8 bit computers) where system calls returned information in the carry flag and therefore it was possible to make a system call and then conditionally branch. In the kernel, this required setting or clearing the bottom bit of element zero of the stack. Although this conflation eases context switching, it greatly reduces the address-space for executable code. This was amortized through the observation that aligned, 4 byte instructions don't require the program counter to hold anything useful in the bottom two bits. Unfortunately, this observation is incompatible with 2 byte Thumb instructions and 1 byte Jazelle instructions.
Avoid fragmentation. ARM, Java and Android are particularly bad in this regard - and there is a strong correlation between these architectures.
There may be further conflicting constraints.
I just ordered an Antminer L3+ so I can mine LiteCoin. When used with an L3+ the APW3++ power supply can be plugged into 110V. The S9 BitCoin miner requires 220V.
I expected the L3+ to make me rich beyond my wildest dreams with its yield of $7000/year. The S9 profit is about the same but it uses 500W more.
But today I read somewhere that Ethereum can't be mined with ASICs. One must - "must" - use the CPU or GPU. It seems that Ethereum's Scrypt proof-of-work is designed to defeat ASICs. Rather than lots of arithmetic as in the case of BitCoin mining Scrypt requires a lot of memory.
I was puzzled by that so I looked into it when I got home this evening - and yes one can use a CPU or GPU.
But to gain the advantage of one's GPU it must have enough memory to fit the entire "DAG File" into video ram. Presently that file is somewhat less than 2 GB but it is slowly growing.
I found a review of the AMD Sapphire Nitro+ Radeon RX 470 that claimed it could perform 2.9 MH/s.
Newegg's page on the 470 recommends a 500W power supply. My Linux box has a 1000W supply. But let's suppose it needs all those 500 watts.
An Ethereum mining profit calculator I discovered by praying to Google, with the 8.16 cents per kilowatt-hour here in the Pacific Northwest with its abundance of hydroelectric dams predict a yearly profit of one hundred thousand dollars per year.
I gotta get me some of that.
I expect that outrageous profit is not yet well-known because BitCoin is getting all the press.
I participated in NAGA's Initial Coin Offering on Friday. I bought 1,400 tokens at one dollar apiece. The company being NAGA is a regular finance firm - so its regulated. I didn't have time to read their whitepaper before the ICO closed but I trust the opinion of the wise old friend who recommended NAGA to me.
Presently I own approximately equal amounts of BitCoin, BitCoin Hash, LiteCoin, Ethereum, Dash and B2B.
The B2B will quite likely turn out to be a mistake because CoinMarketCap's All Cryptocurrencies table said that B2B was going rapidly upward in price.
It was only after I bought some when I looked at the table a second time only to find that B2B had a daily volume of roughly $50k. With such a low volume just one sale or purchase will alter its price by a significant amount.
Really the wise thing would have been to sell it all back but I decided to hold onto it. You know just like my LivePicture stock certificate - it sure is pretty!
LivePicture was the only company whose options have vested for me. That experience led me to avoid startups entirely.
But I'm working part-time for one since a couple weeks ago. Sorry it's in stealth mode so I can't give you a clue but I am convinced its business plan is sound.
It approves 4,000 jobs in Guam, for our H-2B workers to build tremendous, and much needed missile defense capabilities there. With $355 million to make our military INFRASTRUCTURE PERFECTO as we continue our campaign to create maximum pressure on the vile dictatorship in North Korea -- Little Rocket Man. 🚀 It upgrades our ground combat vehicles, allows for the purchase of new Joint Strike Fighter ✈ aircraft, and paves the way for beautiful, beautiful new Virginia Class submarines, the finest in the world. The NDAA increases the size of the American armed forces for the first time in seven years and it provides our military service members with their largest pay increase in eight years. 💰