Can a slow running platform be quick?

Most people have heard of Moore’s Law – which in simple terms states that the processing speed and density of integrated circuits will continue to increase, due to technological advancements in semiconductor design. This is appealing for low power design, since a high speed means that computation can complete quickly and systems can spend more time idle. Moore’s Law surely results in more computation for lower power - everybody wins. Well, not quite.

The problem with this chain of reasoning is that achieving the fast clock rates predicted by Moore’s Law requires deep pipelines. This results in the processor architecture having a much higher ratio of control logic to mathematical computation units. Also, whilst specific calculations may be able to keep the pipeline full, with real world algorithms you get pipeline stalls, either because the memory can’t keep up or because there are conditional dependencies in the code flow. This is a double hit for low power designs, resulting in lots of logic that doesn’t contribute to the mathematical computation and is stalled for large periods of time. As a result using a high speed processor often imposes a cost of more than 10x the power consumption, compared to optimised hardware.

In many architectures the constraints on data routing and control logic make it unlikely that the full computing capability of the processor can be realised. For example, a processor may allow a multiply unit calculating A x B = C to clock fast, but if you can’t get A, B and C to the multiply unit quickly enough, then the data flow becomes the bottleneck and your application can’t take advantage of the increased clock speed. Even Digital Signal Processors (DSPs) which are designed for computational throughput suffer from this imbalance. This results in DSPs which are capable of performing 100GMACs in certain circumstances, but when you try and use it in your particular application you can only use 10GMACs. It must be said that this is not true of all processors, but it’s a problem we regularly encounter.

Here at Cambridge Consultants we have developed the Sapphyre technology for creating DSP cores that can be used for silicon IP blocks, FPGA solutions and custom ASICs. The Sapphyre DSP cores have an architecture which tackles the problem of bottlenecks by taking an alternative approach to balancing the DSP. Instead of clocking the processing elements as fast as possible, the Sapphyre matches the processor execution speed to that of the data flow (RAM and I/O speed). For example a traditional processor might have multiply hardware that can perform the operation for A x B = C in single cycle. A balanced core design would also include the capability to read A, read B, write C and calculate the next addresses for all three in that same cycle. As a result the Sapphyre solution can genuinely perform A x B = C efficiently on every single cycle.

Instead of the traditional approach of clocking the processor at multiple times the access speed of the memory, Sapphyre cores run at or below the memory’s clock speed. This results in uncontested, low latency RAM access, where reads and writes always complete in a single processor cycle; there are no wait states and no caching.

Having got data to the execution units without any delay, those units may have to perform complex or multiple operations to achieve the required processing in real time. Sapphyre DSPs are VLIW cores which can instruct many execution units in parallel, and the lower clock speed makes it more realistic to complete complex operations in a single clock cycle. As a result the output of any Sapphyre DSP module can be used as the input to another module on the next cycle, resulting in a flexible architecture with no fixed data pipelines. Sapphyre DSP cores can therefore be programmed to perform data processing in incredibly tight loops, with a high utilisation of the modules in the design, and the data it is processing.

This approach also avoids the deep processor pipelines of high speed processors that make high speed conditional code so difficult. The flow of data is therefore easier to work with and developers can genuinely use the computational capabilities of the DSP cores.

All of these approaches to the Sapphyre core architecture combine to create a DSP which is low power, highly efficient, and flexible, where the design of the core can be customized according to the task the Sapphyre is performing. This enables PC peripherals to run off coin cell batteries for years and complex satellite modems to be reduced to the size of a box of matches.

The next blog in this series will discuss how to debug all those parallel modules and know that they are correct in a real time system.

Author
Matthew Taylor
Senior DSP engineer