**System Level Power-**REST Georgia Tech **Performance Trade-Offs in Embedded Systems Using Voltage and Frequency Scaling of Off-Chip Buses and Memory** Kiran Puttaswamy<sup>1</sup>, Kyu-Won Choi<sup>1</sup>, Jun Cheol Park<sup>1</sup>, Vincent J. Mooney III<sup>1,2</sup>, Abhijit Chatterjee<sup>1,3</sup> and Peeter Ellervee<sup>4</sup> {kiranp, kwchoi, jcpark, chat, mooney}@ece.gatech.edu Irv@cc.ttu.ee <sup>1</sup>Center for Research on Embedded Systems and Technology (CREST), http://crest.ece.gatech.edu <sup>2</sup>Assistant <sup>3</sup>Professor, <sup>1</sup>Electrical and Computer Engineering <sup>2</sup>Adjunct Assistant Professor, College of Computing Georgia Institute of Technology, Atlanta, GA USA <sup>4</sup>Tallin Technical University, Tallin, Estonia <sup>2</sup>Hardware/Software Codesign Group, http://codesign.ece.gatech.edu

October 2002

Georgia Institute of Technolo





### Overview

- Introduction
- Motivation
- Contribution
- Framework
- Methodology
- Results







## Introduction



essential components of living

• Constraining Factor: Power









# Motivation

• Limited Battery Capacity



 Battery Energy Supplying Characteristic 10 mA, 1.5 volts = 1000 hours 100 mA, 1.5 volts = 80 hours





## **Previous Work**

- Three broad approaches to memory optimization for power/energy reduction
  - Cache optimizations
  - Memory access reduction (especially of off-chip memory)

 Memory sizing/structuring and memory intensive voltage scaling





# **Our Contribution**

- Combination of an architectural technique (store buffer) and a circuit level technique (voltage and frequency scaling) to realize savings in both power and energy in an embedded system composed of an ARMlike processor chip plus a separate memory chip
- System savings in power from 28% to 36%
- System savings in energy from 13% to 35%





## **Computation Part of an Embedded System**



ISSS





## **Power Models**

- Verilog RTL model for processor (excluding caches)
- Compaq Personal Server PCB Board called "Skiff"
- Analytical memory model for caches and off-chip memory



#### Framework







# Wither the power?

- Computation in system
  - MARS processor (U. Michigan, <u>www.eecs.umich.edu/~jringenb/power</u>)
    - ~30K lines Verilog
      - synthesized using TSMC .25u std. cell lib. from LEDA Systems
    - 4KB Icache, 4KB Dcache
  - 0.5MB SRAM memory chip (L2)
- Approximately 50% of the power consumed by processor chip (excluding I/O pads and drivers)
- 50% of the power consumed to drive L2 memory: the 0.5MB memory chip + PCB bus + I/O pads/drivers
- => reduce power to drive L2 memory by 60%, system power reduced 30%









## **Embedded System**



October 2002

ISSS





## **Embedded System (with Store Buffer)**



ISSS





# Methodology

- Voltage/frequency scaling of L2 memory accesses
- Store buffer technique







## **Voltage/Frequency Scaling**







| benchmark | Executable size (kB) | Dynamic instruction count | Input data size   | Data cache accesses | Data cache misses | Data cache miss % |
|-----------|----------------------|---------------------------|-------------------|---------------------|-------------------|-------------------|
| bubble    | 34.852               | 7503                      | 50 integers array | 1675                | 107               | 6.39              |
| factorial | 34.634               | 6033                      | 1 integer         | 2006                | 250               | 12.46             |
| fib       | 34.651               | 30602                     | 1 integer         | 11840               | 262               | 2.21              |
| matmul    | 34.857               | 21642                     | 0.5 kB            | 7358                | 4916              | 66.81             |
| sort_int  | 34.763               | 23171                     | 0.5 kB            | 7808                | 104               | 1.33              |

**Table 2: Execution Statistics for Various Benchmarks** 

|           | Off-chip Bus, Memory at 100 MHz, 3.3 V |          |               |           | Off-chip Bus, Memory at 50 MHz, 2 V |         |               |           | % Improvement |
|-----------|----------------------------------------|----------|---------------|-----------|-------------------------------------|---------|---------------|-----------|---------------|
| Benchmark | cpu+cache (W)                          | bus (mW) | L2 memory(mW) | Total (W) | cpu+cache(W)                        | bus(mW) | L2 memory(mW) | Total (W) |               |
| bubble    | 1.24                                   | 301.64   | 1276.49       | 2.817     | 1.22                                | 96.14   | 541.08        | 1.857     | 34.07         |
| factorial | 1.18                                   | 444.35   | 1236.96       | 2.861     | 1.15                                | 93.16   | 797.08        | 2.040     | 28.69         |
| fib       | 1.25                                   | 287.68   | 1228.23       | 2.766     | 1.25                                | 92.50   | 516.06        | 1.859     | 32.79         |
| matmul    | 1.07                                   | 637.48   | 1713.34       | 3.421     | 1.04                                | 129.04  | 1143.51       | 2.313     | 32.39         |
| sort_int  | 1.27                                   | 336.78   | 1485.92       | 3.093     | 1.27                                | 111.91  | 604.11        | 1.986     | 35.79         |

**Table 3: System Level Power Estimates** 





|           | Off-chip Bus, Memory at 100 MHz, 3.3 V |           |             | Off-chip Bus, Memory at 50 MHz, 2 V |           |             | Percent Change          |                     |  |
|-----------|----------------------------------------|-----------|-------------|-------------------------------------|-----------|-------------|-------------------------|---------------------|--|
| Benchmark | Execn Time ( $\mu$ s)                  | Power (W) | Energy (mJ) | Execn Time ( $\mu$ s)               | Power (W) | Energy (mJ) | Execn Time increase (%) | Energy decrease (%) |  |
| bubble    | 113.945                                | 2.817     | 0.321       | 122.265                             | 1.857     | 0.227       | 7.3                     | 29.3                |  |
| factorial | 116.115                                | 2.861     | 0.332       | 129.325                             | 2.040     | 0.264       | 11.37                   | 20.48               |  |
| fib       | 456.795                                | 2.766     | 1.263       | 463.245                             | 1.859     | 0.861       | 1.4                     | 31.83               |  |
| matmul    | 924.735                                | 3.421     | 3.164       | 1192.98                             | 2.313     | 2.759       | 29.0                    | 12.8                |  |
| sort_int  | 296.425                                | 3.093     | 0.917       | 300.265                             | 1.986     | 0.596       | 1.29                    | 35.0                |  |

 Table 4: System Level Design Space Exploration











# Conclusion

- Reduction in both power and energy
  - For an ARM-like processor chip plus a separate memory chip:
  - System savings in power from 28% to 36%
  - System savings in energy from 13% to 35%
  - Increase in execution time from 1% to 29%
- Possible technique for power modulation by user/application