EDA's acceleration option

12 April 2008

John Busco at John's Semi-Blog has pointed to the launch by Nascentric of an analogue-circuit simulator accelerated by nVidia's graphics processors, and wondered: "Will general-purpose GPU computing become the acceleration platform for EDA?"

I was sitting at the Many-core and Reconfigurable Supercomputing (MRSC) conference in Belfast the other week wondering the same thing. In recent years, hardware-specific EDA has been a dirty word. Mentor Graphics, which made its name selling proprietary workstations before it became a software-only company made a foray back into hardware in a deal with Mercury Computer Systems in late 2006. Mercury used the IBM Cell processor – the same one used in the Sony Playstation 3 – to speed up the job of checking chip designs before they go to fab. Mercury sells the hardware and Mentor provides a special version of Calibre.

It's not clear how well hardware acceleration has gone for Mentor and Mercury. However, in its 2007 annual report, Mercury declared that it saw a "slight rebound" in its semiconductor business, partly due to the sale of one accelerator for chip-mask inspection – which is not related to Calibre – and its deal with Mentor. The number-three EDA company has been busy showing off the hardware at events like the SPIE lithography conference, so the company must have some faith in the idea of speciality accelerators.

The algorithm in Calibre is probably a good candidate for acceleration by GPUs as well as the Cell. One thing that was noticeable from MRSC was that users in the academic environment there were not making that much use of Cell but they were very keen to look at GPUs as well as field-programmble gate arrays (FPGAs) – the latter just happening to be the EDA acceleration technology that nobody really notices.

People have been using either emulators made from hundreds of FPGAs - OK, not that many people – or FPGA breadboards to simulate digital chips for years. Synplicity made such a good business out of doing tools for FPGA-based prototyping that Synopsys dismissed the fact it killed off a tool to do the same thing a couple of years ago and bought the company. (Actually Synopsys has gone through three different FPGA synthesis tools in recent years - FPGA Compiler, FPGA Compiler II and DC FGPA - we will wait and see how it does with the stuff it buys in.)

When Mentor unveiled its deal with Mercury, Joe Sawicki, the head of Mentor's Calibre operation, said they had changed the way the tool worked in such a way that it would better suit an accelerator like Mercury's. In the past, tools like Calibre used a sparse representation of the chip's surface to perform their analysis. Mentor's argument was that, at 45nm, the features on the surface of a chip are so densely packed that you might as well just chop it up into a regular grid and have at it with fast Fourier transform operations.

If there are two things that run well on accelerators, it's regular grids and FFTs. And I can't see a reason why a GPU would not be a potential candidate for the nmDRC software. I'd be surprised if Mentor wasn't looking at a GPU option.

But, not everyone is convinced that accelerators are the future. Srini Raghvendra of Synopsys made the point at the time that, with general-purpose multicore processors on the way from AMD and Intel, optimising for a dedicated accelerator from a single hardware vendor was unlikely to be a long-term option: "We believe we can be comfortable riding the general-purpose processor horse."

One thing that hits a lot of EDA software is bandwidth: between processor and memory and from memory to disk. It can take hours just to read a design in and hours to write it back out again. Your best bet might not be a funky accelerator but a half-decent storage area network with a bunch of fat pipes into the back of your server farm.

Then there is the issue of how much EDA software has actually been multi-threaded to run across multiple processors. With the kind of job that a tool that Calibre does, multi-threading is commonplace. If design teams don't want the Mercury accelerator they can just buy a bunch of Calibre licences and run them across a server farm. With layout checks, the shape of one logic gate does not affect another one just micrometres away. You can, with some limitations, chop up the grid into little chunks and distribute them to many processors without worrying too much.

A lot of EDA software is not so lucky. It is only recently that Spice simulators, such as those from Nascentric, have gone multi-threaded. As recently as last year, people were arguing how useful multi-threading would be in that environment. Regular Spice is all about solving regular matrices. Fast-Spice simulators typically use sneaky mathematical tricks to avoid having to crunch through massive regular matrices. The name of the game, as with a lot of EDA, is to convert a problem that scales with square or cube of the number of elements to something much more linear, or even logarithmic.

Unfortunately, these more optimised algorithms don't necessarily divide well. If you're not careful, you can spend so much time sifting and sorting data that the speedup you get from multiprocessing gets almost wiped out. So, something like Spice, which looks like a great candidate for multiprocessing, doesn't fare so well.

The results from academia on sparse-matrix acceleration using GPUs are good but not spectacular so far. People tend to report the same problems: memory bandwidth issues, limited cache memory on the GPUs themselves and the need to run thousands of threads in parallel to get any meaningful acceleration. People have reported speedups of maybe 2x, sometimes 10x, but not the 100x you might expect from taking something that runs on one processor to a graphics chip with 128 processors inside.

The release from Nascentric is a masterpiece of legerdemain in that respect - leading you to assume that you will see a 100:1 or even 500:1 level of acceleration, based on the number of SIMD processors you get in the hardware. And then there's the claim from John Croix, Nascentric founder and CTO: "Using nVidia's Tesla platform we can perform circuit simulations in minutes to hours that would previously have taken hours, days and weeks."

However, there are no numbers in the release to back this up. You may have the situation where regular Spice code gets a big speed boost but does that happen for some of the Fast-Spice algorithms. Bear in mind there is a lot of tweaking inside Fast-Spice. Designers are invariably turning things off in the hope that their circuit doesn't depend on those elements. The detail on the numbers for each of those cases would be pretty revealing.

A second issue is one of numerical precision. I'm not sure how much codes like Spice depend on double-precision floating point maths but, on GPUs, you really only get single-precision. You are more vulnerable to underflow and overflow issues in code that iterates a lot. Scientists tend to worry about this a lot, although numerical analysis work by computer scientists is doing a lot to assuage those concerns. However, accuracy could be an issue with GPU acceleration in EDA.

On the other hand, any company that looks at GPU acceleration is set up for the future. With OmegaSim GX, today you have to buy a separate accelerator. A few years from now that same code may simply harness the integrated GPU sitting on the AMD or Intel processor. It gives floating-point intensive software an source of Glops in addition to extensions such as SSE2. Why not make use of it?

Personally, I reckon that hardware acceleration in EDA will rise to the surface for a while then disappear again as the general-purpose processor and PC-blade makers absorb those technologies - who knows, they might be a lot better at running technical codes than games. EDA companies would be wise to explore the FPGA and GPU options if only because elements of those products are likely to wind up inside the workstation and the blade. But take the speedup claims with a large dose of salt.

5 Comments

Chris - Your knowledge of EDA is insightful, and your take on the efficacy of hardware acceleration in EDA is right on! I can understand why you were not satisfied with the details in the press release (“as being legerdemain”). Press releases are hard to compose - they need to be generic enough to address the masses, and often the details are left out. Let me try to briefly explain what OmegaSim GX has accomplished and why we think there is a fundamental reason to believe this is decisively a huge accomplishment, thanks to NVIDIA’s GPU platform (admittedly, I am far from being impartial here).

First, you are right on with respect to Fast-SPICE techniques which, with mathematical tricks, have been able to run faster on larger designs, albeit with accuracy loss within 5% to 10% of SPICE. Our software-only solution in OmegaSim and OmegaSim MT has taken a different route: we address the triumvirate of performance, capacity and accuracy with our patented “army of ants” approach. We run much faster than competitive Fast-SPICE simulators, however, we still have inaccuracy within the range of 2% to 5% of SPICE. When we polled our customers last year, their feeling was that, for sub-65nm process nodes, even 2% to 5% of inaccuracy was not good enough. They required a Fast-SPICE simulator performance/capacity profile and accuracy of within +/- 0.5% of SPICE. The result is our GPU-enabled OmegaSim GX.

In the first release of OmegaSim GX, we are not accelerating sparse-matrix calculations on the GPU as you conjectured. We profiled sub-65nm circuit simulations, and we found that when you run SPICE at its most accurate mode, transistor evaluations consume the bulk of total simulation time – anywhere from 80% to 90% of total run time. We ported this ‘low hanging fruit’ onto the GPU using CUDA. So, instead of deploying the “Fast-SPICE trick” of a simplified transistor model, we can now run the detailed BSIM transistor models in our GX option as-is and remove this huge transistor evaluation burden from the CPU. Our secret sauce is how we call and compute this detailed transistor evaluation without the communication overhead between the CPU and GPU. So, in effect, we get near-SPICE accuracy of within +/- 0.5% and an overall speed up of anywhere from 5X to 8X – limited by Amdahl’s law. We are looking to off-load other simulation tasks on to GPU in the future, working closely with NVIDIA.

Second, with regard to your concern about numerical precision - we were also concerned about single-precision accuracy when we first embarked on this project. However, our BSIM implementation is different from that found in traditional SPICE (again, part of our "secret sauce"). Our equations are the standard BSIM equations, but our GPU implementation, even in single-precision, yields results that are within 5 significant digits of the double-precision results you achieve on the CPU. Having said that, NVIDIA has a roadmap that will address double-precision computation in their GPU in the not-too-distant future.

Finally, with regard to your thoughts on that hardware acceleration having delivered marginal success in the past and your question concerning the long-term viability of a GPU implementation: I agree that historically, hardware-assisted verification tools in the EDA industry have had limited business success. I have personally had prior experience in hardware acceleration for another class of EDA simulators, known as Verilog. At Tharas Systems, we did just that. The issue there was not that we couldn’t offer a compelling value to our customers; the issue was simply the economy of scale in the EDA industry to amortize expensive, custom hardware development. With the GPU, Nascentric doesn’t have to develop, maintain and upgrade hardware every 18 to 24 months, to just keep up with Moore’s law. NVIDIA will work on that and amortize the costs over the hundreds of millions of GPUs they ship each year. Their Tesla product line address various high-performance server class computing challenges (via 1U form-factor) - EDA is just one of the industry they are targeting. We strongly believe that this is a new paradigm which EDA companies should take a serious look at. Tell me, when was the last time you saw EDA hardware for a 128-way parallel, floating-point engine that was listed at $1,299 and backed by the might of the largest fab-less semiconductor company? I will guarantee you that even the BOM for any of the current EDA accelerator/emulator tools from Cadence, Mentor and others costs $100K or more! That doesn’t even account for the amortization of expensive mask costs every 18 to 24 months and software infrastructure to allow you to port an application software on it So, one can look at leveraging GPU as revolutionary. This is way too cool!

hi Chris,
This is a fantastic post

Nik

@Nik, Thanks

@Rahm,

Thanks for the detailed comment. I was going to post an update after ploughing through the recording of the nVidia analyst meeting and where you explained about using the GPU to accelerate BSIM, which all makes sense. But you beat me to it.

Rahm wrote:
|> So, instead of deploying the “Fast-SPICE trick”
|> of a simplified transistor model

This is far from a universal approach. HSIM, for
instance, uses the same analytical equations as
SPICE (depending on either the built-in analog
circuit recognition engine, or a user-defined
instance-specific config setting.)

In addition, on what evidence is FastSPICE 5% to
10% off SPICE? There are even circumstances under
which FastSPICE can resolve to a higher accuracy
than SPICE (Mike Demler writes on this in his
Analog Insights blog, so I won't repeat it here).

This is an excellent article!

Hello Chris,

FYI - I have posted the comments that I submitted here on my blog at http://synopsysoc.org/analoginsights/?p=61, and also provided a link back to your original post.

Regards,
Mike