The "jump threading" compiler optimization (aka -fthread-jump) turns conditional into unconditional branches on certain paths at the expense of code size. For hardware with branch prediction, speculative execution, and prefetching, this can greatly improve performance. However, there is no scientific publication or documentation at all. The Wikipedia article is very short and incomplete.
The linked article has an illustrated treatment of common code structures and how these optimizations work.
(Score: 1, Interesting) by Anonymous Coward on Monday November 02 2015, @04:14PM
Seems similar to a technique used for in-order processors. Since conditional branch mispredictions are expensive (a misprediction has to discard instructions and refetch the correct branch), you execute both sides of a branch and simply "select" the correct result at the end. The "if" at the end simply forwards the correct result to the subsequent code, so no re-fetch required. At worst, you have a one cycle stall as you dump the last assignment (you should hint or structure the compare such that the processor always pre-fetches the last assignment). At best, you have a "SELECT" instruction so no stall or branching to deal with (ex. PowerPC fsel or Intel CMOV).
if (x)
result = foo(x);
else
result = bar(x);
turns into
altresult = foo(x);
result = bar(x);
if (x) result = altresult;
explaining why the following is sub-optimal is left to the student
result = bar(x);
if (x) result = foo(x);
(Score: 2) by Alfred on Monday November 02 2015, @08:32PM
Branch misprediction is expensive but modern branch prediction is good enough to pay for itself in cpu cycles saved.
Even the PlayStation ONE cpu had a set of branch likely instructions. It seems to me that would be the easiest way, just tell the chip which way to prefetch and only fail once per loop block or less than half per conditional branch. I thought modern chips did the same but I'm not sure, I know the mips3000 in the ps1 did.
Maybe that is what you were trying to say. I think the implementation in the article could be achieved by unrolling the first pass of a loop or other opcode bloating optimization. It is really just interesting graph theory.