| |
 |
 |
|
 |
| Dual
Processor Server |
Dual
Processor Server Optimization
Click
here to view the interactive VisualSim
Block Diagram and model
Figure 1 VisualSim Simulation Model
of the Dual Processor Server
Click
here to execute the VisualSim model
and view analysis results
Figure
2 Analysis using VisualSim Dual
Processor System
Introduction
Company X is developing a state-of-the-art
dual processor server appliance
to support a variety of applications
in networking and security. The
company has limited information
on the software application but
has knowledge of the arrivals rate
of data and instructions to be processed.
The project will deliver the lowest
cost server appliance for application
rates of up to 40 Gbps.
Purpose of Analysis
A proposed system has been developed
by the systems engineers. This system
must be analyzed for different application
rates and break-off points between
a single or dual CPU option. Being
a cost-sensitive product, it is
important to design the components
for the worst-case but not to over-design
the product.
Evaluation
Criteria
There are two unknown criteria for
that must be evaluated for:
-
Rate
of arrival of instructions from
the application
-
Data
requirements for each instruction
to be executed
This
particular modeling effort looks at
the impact of these two constraints
against the number of CPUs and bus
speed.
System
Block Diagram
The architect has developed a basic
block diagram with the physical
elements and knowledge of the various
instructions and request flows through
the system. The block diagram of
the system is shown in Figure 3.

Figure 3 Block Diagram
of Dual Processor Server
|
The
architecture of this server blade
consists of a shared bus with two
CPUs, cache and SDRAM. There are
four communication scenarios for
the applications through the system.
-
CPU
= (CPU_1, CPU_2)
-
Hit
= (CPU_1, CPU_2) -> BUS ->
Cache
-
Miss
= (CPU_1, CPU_2) -> BUS ->
Cache -> Bus -> RAM
The
starting assumptions for the distribution
of data requirements for various instructions
and the processing time on each architectural
component are shown in Figure 4. The
CPU processes 60% of all the instructions
without requesting for external data
and 40% requests for data from the
Cache. Of this 40%, 36% has a match
(Hit) in the cache and responds to
the CPU with the data while 4% has
to take the additional step to the
external SDRAM to get the data.

Figure 4 Starting
Assumptions
|
Model
in VisualSim
The block diagram and starting assumptions
are modeled and simulated in VisualSim.
The VisualSim model and the analysis
windows are shown in Figure 1 (
Top)
Methodology:
The VisualSim modeling methodology
separates the workload, physical
topology and the communications
of the instruction. The VisualSim
model consists of the following
parts:
-
Topology:
The top left part of the diagram
describes the physical elements
that form the block diagram. The
queue occupancy, latency, context
switching, scheduling and preemption
are determined at the physical
elements.
-
Communication:
The top right part of the diagram
describes the flow of the instructions
and requests through the system.
It starts at the IN3 and IN4 elements.
-
Application:
The bottom part is the abstraction
of the application using a stochastic
process, data structure as a traveling
token and a multiplexer to determine
the CPU that the instructions
need to be delivered too.
-
Based
on the incoming data structure,
the CPU determines if a data request
needs to be sent to the Cache
or the instruction can be processed
internally.
-
The
Bus and CPUs are described using
the hardware schedulers while
the Cache and RAM use the software
scheduler.
-
The
Bus, Cache and SDRAM allow for
preemption while the CPUs are
scheduled as first-come, first-serve
without preemption.
-
The
flow in the model starts at the
workload generator at the bottom
of Figure 1.
-
Based
on the decision at Decision_Point,
the instructions are executed
in CPU1 or 2.
-
The
data requirements for the instruction
execution are randomly generated
and are included as fields in
the data structure. Based on the
data requirements, the request
can be processed entirely within
the CPU or routed through to the
hit or miss paths on the right.
-
The
blocks in the communication flow
on the right communication in
a connectionless mode with the
physical elements on the left
to execute the incoming instructions.
-
Statistics
generators collect analysis data
from the physical elements and
display the results.
Scenarios
simulated
All of the blocks in the model are
parameterized for exploring different
scenarios. The model also has some
global parameters that can change
the operation of the simulation.
For the purpose of this analysis,
four scenarios have been created.
In Figure 2 (Top), experiment with
the following:
Click on Go and execute
simulation with the preset parameters.
-
Start
with the Base configuration of
parameters provided. The top results
window shows the utilization of
the various devices such as CPU,
Cache, Memory and Bus
-
Modify
the Task_Rate to Task_Time*0.2.
Click "GO". The simulation
executes about 20% of the way
and then generates an exception.
These messages are natural and
form an important configuration.
This message indicates that the
buffer to the CPU1 has overflowed
and is dropping transaction. This
forms the upper boundary of data
that can be transmitted through
the system
-
Next
modify Task_Rate back to Task_Time*1.0
and modify Number_of_CPUs to “1”.
Click "GO" The utilization
of the CPU2 will be significantly
higher than in Case 1. The utilization
of CPU1 will be zero indicating
that this CPU has not been turned
on.supported by this transaction.
-
Finally
keeping the Number_of_CPUs at
1, modify Task_Rate to Task_Time*0.43.
Click "GO". This will
come up with a similar error message
as the Case 2 indicating that
this is the upper limit of transaction
rate that can be done.
Results
The Analysis window in Figure consists
of three sections:
-
The
left side has all the run controls
associated with the simulation.
The performance engineer has provided
these as the simulation attributes
that can be modified.
-
The
top display window displays the
value of the data structure as
it completes the processing at
the CPU, Cache, Bus and SDRAM.
Once the simulation has completed,
the aggregated statistics is presented
for each physical element- CPU,
Bus, Cache and SDRAM.
-
The
second graph is a timeline that
displays the amount of cycle consumed
for each arriving instruction
at the different physical element.
The graph is captured on a timeline.
As
the various run control parameters
are modified, you will notice the
utilization of the CPU changes dramatically.
The change in utilization of the Cache,
SDRAM and Bus has not changed considerably
during this period. This indicates
that the Cache, SDRAM and Bus have
been over-designed and can be optimized
further. Also, the bottleneck does
not occur as originally conceived
at the Bus but at the queuing stage
feeding the two CPUs.
Summary
This experiment can be used to optimize
the design for performance and cost.
As the model details are increased
other explorations can also be performed.
These include optimizing the micro-code
execution order, power estimation
and functional discrepancies. This
model was created in about 2 hours
and about 2 days was spent on analysis
and further refinement. This documentation
took ½ day to be completed.
|
|
|
|