Overviews

Embedded systems require maximum performance from a processor within significant constraints in power consumption and chip cost. Using software pipelining, high-performance digital signal processors can often exploit considerable instruction-level parallelism (ILP), and thus significantly improve performance. However, software pipelining sometimes fails to utilize the processor efficiently and, in some instances, hinders the goals of low power  consumption and low chip cost. Specifically, the following two problems exists:

  1. The parallelism available in the innermost loops of many applications is not enough to fully utilize the parallelism provided by the processor.

  2. The registers required by a software pipelined loop may exceed the size of the physical register set.

The first problem leads to wasted processor resources and increased energy requirements of loops due to leakage power of idle functional units. The second problem makes it difficult to build a high-performance embedded processor with a single, multi-ported register file with enough registers to support high levels of ILP while maintaining clock speed and limiting power consumption. The large number of ports required to support a single register file severely hampers access time. The port requirement for a register bank can be reduced via hardware by partitioning the register bank into multiple banks connected to disjoint subsets of functional units, called clusters. Since a functional unit is not directly connected to all register banks, wasted energy and resources can result due to delays incurred when accessing ``non-local'' registers.

The above problems can be ameliorated by using advanced compiler loop optimization techniques. The utilization of the available functional units can be increased by using high-level loop transformations such as unroll-and-jam to increase inner-loop parallelism. High-level loop optimizations can also be used to spread data-independent parallelism across clusters that does not require ``non-local'' register accesses and to provide work to hide the latency of any ``non-local'' register accesses that are needed.

Current methods for applying loop transformations to solve the above problems are lacking. Metrics for applying loop transformations do not model high-performance DSP architectures and the effects of software pipelining effectively. In addition, optimization strategies for partitioned register banks are ad hoc. To date, no comprehensive loop optimization model has been developed for architectures with partitioned register banks.

This research will address the above problems by developing and experimentally validating the following:

  1. A performance metric that accurately models software-pipelined loop performance on high-performance DSP architectures. This includes accurate modeling of vector (SIMD) operations and the effects of the copies introduced by partitioned register banks.

  2. A prediction of the register pressure of a software-pipelined loop before high-level loop transformations are applied. This includes predicting the effects of loop fusion, scalar replacement and unroll-and-jam on register pressure before they are applied for both partitioned and traditional register files.

As a result of this research, more ILP will be exploited in DSP applications, resulting in an increase in performance and a savings in the overall energy required to execute an application. Improvements in performance and energy usage will, in turn, allow better and more computationally expensive algorithms to be used in embedded systems.

This project is supported by the National Science Foundation under grant numbers CCR-9870871 and CCR-0209036

Go to the Top