Sapphyre is our in-house developed ecosystem for creating custom DSP cores, with years of practical signal processing experience and proven silicon deployed in millions of devices.

The hardware implementation of high performance real time DSP algorithms typically takes one of two forms; either an off-the-shelf Digital Signal Processor (DSP) running software, or a custom hardware pipeline synthesised on a Field Programmable Gate Array (FPGA) or Application Specific Integrated Circuit (ASIC). The former is the most common method and the workflow for these software DSP developments generally follows a 3 step sequential process:

1) Modelling the algorithm to assess performance

2) Implementing the algorithm in software on the device

3) Optimising the software code using intrinsics or assembly

The real value for a project in terms of power and performance comes down to the last step. Here, the raw power of the specific architecture can be exploited by the application. The ability to optimize a given algorithm is thus limited in the extreme by the capability of the processing architecture. At the point when the application software is being written, the hardware architecture is already frozen harder than a coffin nail. Need that one extra instruction that could save precious cycles in the middle of a tight loop? Tough!

So when the software DSP approach fails to meet real time or low power targets, product developers take the second hardware pipeline approach. This can often alleviate the constraints of any particular processor architecture. By designing the algorithms into a digital hardware design the exact memory, data routing, arithmetic and I/O can be customised. The big downside with this approach lies in the final implementation. An FPGA targeted application is relatively power hungry and costly, and an ASIC can be expensive and slow to develop and verify. Furthermore, once an ASIC is developed, very little can be done to modify it with feature upgrades, variants or bug fixes.

Sapphyre bridges the gap between fast development and performance. At its heart, Sapphyre is a Very Long Instruction Word (VLIW) DSP with numerous execution units (arithmetic, MAC, memory, etc.) and routing connections. This approach enables extremely fast execution even for deep pipelined operations. The number and types of execution units can be easily modified by the software developer, with the rest of the toolchain reflecting these changes automatically. This has the additional benefit of allowing fast re-use of execution units (which have already been implemented and confirmed) in new core designs.

The entire toolchain is designed to be extraordinarily flexible - all the way from the cycle accurate simulator to the non-intrusive debugging (more on this to follow in future blogs), the assembler, and the resulting hardware design. Rapid algorithm prototyping is facilitated by this tight coupling, and allows new instructions, data routing and multi-core ideas to be developed in tandem.

This flips the traditional DSP design workflow on its head, allowing software teams to lead the hardware design. No longer is the software DSP developer constrained by processor architecture. If the software developer sees that an extra ALU would make the code more efficient they can update the design and try it straight away.

With Sapphyre, the time required to prototype a new algorithm (or hardware acceleration of an existing one) is reduced to mere hours, and a complete chip design to mere months. Furthermore, the toolchain has been designed to make use of continuous integration systems to automatically verify code and hardware together. By running all of the code for the target application on both the software simulator and the hardware RTL simulator, the chance of an edge-case bug slipping into production silicon is significantly reduced. With the costs of an advanced generation silicon node tape-out reaching into the millions of dollars, getting the chip back correct the first time is more important than ever.

Because the Sapphyre design is still a programmable device, the inflexibility of a traditional ASIC design is alleviated. The integrated approach also means that the final code base is more closely matched to the platform when compared to traditional DSPs, and the final chip is also more flexible than that of fixed hardware pipeline. The result being a system with an accurately tailored instruction set, but with enough room for the software running on it to grow. Furthermore, with the Sapphyre approach the best results come from developing software and hardware in parallel, which also reduces risks and development timescales.

The next blog in this series will explain how the resultant cores also achieve remarkably low power.

Brendan Gillatt
Principal AI & DSP Engineer, Edge AI Technology Lead

Brendan is a Principal Engineer at Cambridge Consultants, with a strong focus on digital signal processing and Edge AI. He has been a team lead for a number of communications, audio and Edge AI projects. Part of this work has been in partnership with Arm to evaluate and utilise the new U55 hardware and tools.

Additional areas of research interest include remote sensing, computer vision, and computer architecture. Brendan has a demonstrated record of developing low power, high performance embedded systems for clients ranging from start-ups to multinationals.