Exploring the performance and power management of a real-time video application

Problem statement

The requirement was to select an appropriate hardware platform for a MPEG video application so as to optimize the performance and power usage of the application. As part of its functioning, the application receives the data on the input sensor, for example, a light sensor. After the processing, the output is displayed on the camera itself or may be displayed on a connected TV.

The intended hardware platform was expected to assist the application to process 13000 matrices in 20 milliseconds, consuming less than 1 watt of power.

Model construction

We proposed to select the hardware platform by constructing a model using the VisualSim building blocks. The model consisted of two parts. The first is the architecture part which is the top portion of the model that contains the processors, the buses, the internal memory, the external memory, and a hardware accelerator.

The lower part represents the behavior and provides information about the processing of the data that is received from the light sensor. We also get details of the sequence of the operations. The sequence of operations is mapped to the hardware platform. Note that the lower portion does not execute but just provides the details of the operations. In addition, we have a power table that maintains the power consumed by all the devices in the hardware platform.

Initially, when we run the model, all those functions that we see at the bottom in the green blocks, they all get mapped to the processor. In VisualSim, we just do not run the real C code, but we emulate the C code by generating a sequence of instructions. Each green block contains an instruction generator. Based on a certain profile for that particular function, the instruction generator generates a sequence of instructions that emulates what a real application would be doing. For example, the image rotates or pre-processing or post-processing. The instructions are then sent to the processor block on the hardware architecture for being processed.

Types of analysis

The goal for this exercise is to ensure that the application processes 13000 matrices per second and consumes less than 1 watt of power. To start with, we run all the functions on the software. All the seven or eight functions from the sensor to the display are implemented in software. We found that though the power usage is below one watt, the performance does not meet the 13000 matrices per second criteria.

After further analysis, we found that one of the functions, the image rotate, was consuming most of the CPU cycles. Hence, we did a hardware software partitioning, and moved the image rotate function to the hardware platform and used a hardware accelerator. All other functions were continued to be executed by the software. Now when we ran this simulation, we found that the performance criteria, the 13000 matrices was met. But the power usage was 1.5 watts.

We understood that when all the functions were run using the software, the power was less than 1 watt. The spike in the power was found to be due to the hardware accelerator. We accessed the power management block, and changed the parameter to turn off the hardware accelerator when it was not being used. This ensured that the performance criterion was met as the hardware accelerator processed the image rotate function. In addition, due to the alteration in the power management table, the power used was less than 1 watt.

Solution

As part of the solution we have done three different experiments. First, we ran all the functions in the software. Next, we identified a candidate for acceleration and moved it to hardware. Third, handling power management. Then, from an analysis or methodology standpoint we have done three things. Analysed Power vs. Performance. Second, we did a partition of the hardware and software of an application. Finally, we incorporated power management.

Using these methodologies, you can define any processor or any semi-conductor component; describe a combination of processing units, memories, accelerators, DNAs, buses and other topologies. You can also map different applications and see whether they meet the required criteria which is the goal of this particular design.