Embarrassingly unoptimised

29 November 2007

At a workshop on advanced architectures earlier this week, ARM's director of research Krisztián Flautner made an unprovable but seductive assertion. He said upfront that he had practically no evidence for it. But it's one of those statements that has the ring of truth about it. And at its heart is something that may prove to be the big problem that faces microprocessor makers as they try to work out how many cores they should put on one piece of silicon.

To make use of tens or hundreds of cores, you need applications that are "embarrasingly parallel". But Flautner is not so sure such things exist. His rule? That there is no such thing as an algorithm that is embarrassingly parallel, just an algorithm that hasn't been optimised enough yet. He pointed to situations in graphics where operations that used to be parcelled up and split across many processors in a dumb way have been replaced by approaches that apply more local intelligence but which are much tougher to parallelise.

I can think of one or two algorithms that remain embarrassingly parallel - they just aren't that useful. Anyone roving an electronics show in the late 1980s will remember people trying to push the Inmos transputer. If they weren't displaying scenes of shiny balls rendered by ray-tracing, they were processing the good-old Mandelbrot set.

People found lots of other ways to render scenes since then that don't parallelise so well. It's hard to think of a smarter way to generate Mandelbrot set images than the way they were processed then. But why would you want to?

1 Comment

Like Gaul, the "parallel" or "concurrent" problem might be divided into three parts. What Krisztian was referring to is really a kind of SIMD multi-threading kind of parallelism (where the "Instruction" might be a few instructions in a loop nest, for example, and you cut the data up into multiple portions that can be worked on by multiple processors in parallel, and then you string it all back together). Indeed, current applications with greater data dependencies and clever tricks may be much harder to parallelise in this simpler fashion.

But there are other embarrassingly parallel applications if we don't consider just having one copy of the application running for a single user, but having multiple copies running by a service provider : a kind of SAMD (Single Application, Multiple Data). There are many applications in this domain going back to the first airplane reservation systems and early banking systems - only the "cores" in the early days were potentially mainframes linked together in a primitive network. Today's neoclassical SAMD applications are for example web serving, where the big iron multicore processors such as Intel, AMD, Sun Niagara etc. might be serving multiple independent web queries at once. Another one which I know of well is packet processing, where Cisco's CRS-1 router uses the SPP - silicon packet processor, with 192 processor cores on it, each potentially processing a different packet (I work for Tensilica and this uses Tensilica processors).

The final class of concurrency (not really "parallelism" but definitely concurrent) is exposed in many signal and image processing applications from baseband processing on a cell phone through to image processing on printers and video systems: Asymmetric multiprocessing using pipelined dataflow concepts. Here one can take advantage of multiple heterogeneous processors each optimised to one piece of an application, where each processes input tokens (frames, samples, etc.) in turn and passes them on to the next processor in the pipeline which is optimised for the next part of the task. The first processor then starts working on the next token, frame or piece of data, so there can be a lot of concurrent processing going on. Pipelined dataflow applications keep popping up in all kinds of portable and non-portable consumer appliances as we want more multimedia. Here a very different programming style is needed to chop up the application into a sequence of tasks that each handle one part of the overall application and then to devise the right processor for each stage.

One notes that many of the new wave of multiprocessor or multiengine providers are looking to multimedia, especially video, for their best application space - often multiple streams (as in video security uses), which is a kind of SAMD as above, and each stream can use multiple pipelined processors as in the dataflow. This is not exactly embarrassingly parallel, but certainly has a ton of potential concurrency.