The Goldilocks processor

31 July 2008

If there is one thing that troubles processor architects right now, it's working out how many cores they should stick on a die. The number of transistors they can plant on a chip doubles every two years and there's no sign of that supply running dry in the next five years.

What's the problem? Just take the processor core you have already and then step and repeat it across the die. It's worked for graphics processors.

Unfortunately, only some software parallelises so well that it will spread across many cores. Many times, the overhead of distributing the work outweighs the advantage you get from running the code in parallel. This, in effect, is the modification that Gene Amdahl made to his eponymous law of performance in computers.

In its most basic form, Amdahl's Law says it's only worth speeding up things that you do a lot. Big, nested loops are good targets. Lots of branching straight-line code? Not worth the effort. With parallel processors, if you can spread the work of loops over many of them, you see a speed-up. But there is a limit governed by how much code you need to run on just one processor.

In a paper published in this month's IEEE Computer, a pair of researchers from the University of Winsconsin-Madison - one of whom has now moved to Google - has attempted to extend Amdahl's Law to the world of multicore processors where you do not necessarily make all the processors the same size.

Up to now, most of this kind of spreadsheet-based performance estimation has concentrated on what happens if you make all of the processors the same. Last year at the Design Automation Conference in San Diego, Intel's Shekhar Borkar, looked at the trade-offs facing the company as it looks at the possibility of moving from four to eight and sixteen-core chips to designs in the next decade that could support thousands of on-chip processors.

Borkar cited Pollack's Rule, which says the performance increase of a processor is roughly proportional to the square root of the increase in its complexity. Put another way, if you double the number of transistors in a processor, the chances are that you will only get a speedup of about 40 per cent.

In this environment, having lots of fairly simple cores looks to be the favourite option. But, power becomes a big, big problem. Borkar noted that simply sticking slavishly to this plan would have a devastating impact on power consumption, because all of the energy is used to support the interconnect that feeds all the processors.

Using 100 much more complex, bigger cores would bring the power needed for the mesh down by a factor of ten to just 15W. "What we need is a really careful balance," he said. "Just because we can integrate thousands of cores, don’t get carried away and implement a thousand cores."

Tilera's Anant Agarwal defined his own version of Pollack's Rule: the Kill Rule. This says: "A resource in a core must be increased in area only if the core’s performance improvement is at least proportional to the core’s area increase."

Agarwal reckoned that dual-issue superscalar processors will become the norm as they largely satisfy his Kill rule, although there is a broad spread of those designs. A complex dual-issue architecture might need 15 million transistors, including cache. The simpler designs offered by companies such as ARM and MIPS Technologies, which rely more on the compiler’s ability to schedule instructions effectively, might be squeezed into a 5 million transistor block.

The argument from Mark Hill and Michael Marty from the work at Winsconsin-Madison is that you can do it all. With enough transistors to play with, you can have some brainiac superpipelined, superscalar monster that gets to work on all the straight-line code and have lots of smaller, simpler processors get on with the parallelisable stuff.

They show, for a very large array, the monster-plus-servants approach shows reasonable acceleration for code that is 90 per cent or more parallelisable and certainly performs better than the situation where all cores are equal.

There is a warning here. This assumes that your only constraint is the number of transistors. It doesn't take account of interconnect overhead or power consumption. But it's a result that tends to vindicate the asymmetric design of the modern PC if you start to use the GPU for acceleration.

Where the gains could be greater is in the idea of a reconfigurable array. The idea is that you make a monster processor out of lots of little ones. The speedups on this look great. Then again, given that nobody knows how to make this kind of machine, you would hope so.

The problem is that big processors are not made out of little ones. It's more that, inside every monster processor there is a little one and a chunk of stuff that tries valiantly to keep shoving data through that processor. You can't easily turn processors into prediction and speculation units, at least not with current architectures.

Hill and Marty argue that, despite this, it is a result that is worth closer inspection. What if you could reorganise processors to act as speculation units? Or maybe come up with new architectures that can be split apart and recombined as needed? There is one architecture that looks as though it could be a contender for a separable architecture: this is the TRIPS processor out of the University of Texas. It starts out as an array of processors with an instruction-scheduling scheme that pulls them into virtual machine.

TRIPS calls for a change to software compilation. You also have the work on speculative compilers such as Codeplay's Sieve-C, which runs on regular combinations of PC processors and GPUs. This is designed to work on parallelisable code - it expands the amount of software that can be run in parallel by allowing it to make mistakes and then fixing them afterwards. This kind of strategy might be used to have all the speculation work performed in today's monster processors done in software on a bunch of simple processors.

There are no ready answers, but Hill and Marty have indicated some areas where it might be worth computer architects looking, just as long as the performance predictions aren't completely knackered the minute you invoke power consumption or interconnect overhead.


Regarding... "Where the gains could be greater is in the idea of a reconfigurable array. The idea is that you make a monster processor out of lots of little ones. The speedups on this look great. Then again, given that nobody knows how to make this kind of machine"... please take a look at the Ambric Am2045 MPPA device, in production with multiple embedded design wins and winner of the 2006 Microprocessor Report Innovation Award. (MPPA stands for Massively Parallel Array Processor) The TeraOPS-class chip has 336 RISC processor cores and an interconnect fabric with 25 patents pending, that makes MPPA programming practical, even fun, according to Ambric customers such as Telestream. The architecture scales to thousands of processors, and offers the top energy-efficiency available.

I think the regular arrays like Ambric suffer from the core problem of not running serial problems very well. They are designed for data-parallel accelerators, alternatively a sea of easy threads.

If you have a workload that contains some "stubbornly sequential" code, they won't help you much. That's where the idea of having one or a few heavy-lifter cores on a die along with lots of little simple cores for the parallel part helps.

The key thing is that the heavy core is code-compatible with the lesser cores. And thus, when code gets annoyingly sequential, you can apply the strength of the heavy core to push through that section faster. Basically, having a strong ice-breaker along on a sea expedition for the few cases where you do get into heavy ice.


The Loki project at Cambridge University in the UK advocates an array of simple processors that can be reconfigured

In practice these architectures still don't solve the problem of how to distribute clocks at reasonable power at very small geometries (28nm and lower)