UsingDMA

Parent Previous Next

VisualSim                                                                                                                              


Creating a System Model


Tutorial Goals

Model Location

Open this model in VisualSim from the following location:   

        $VS/doc/Training_Material/Architecture/Tutorial/Architecture_Exploration/Using_DMA/


Model Introduction

This model demonstrates the use of the Hardware Architecture Library in VisualSim. The session combines the performance and architecture libraries in assembling a detailed system model. This session helps you understand how to connect and configure the blocks into a model.

Block??Diagram of the System Architecture

Figure 1: Block Diagram of the System Architecture Model

VisualSim model??of the System Architecture Model

Figure 2: VisualSim model of the System Architecture

List of Block Used

Sl No

Library Block

Description

1

Digital Simulator


ModelSetup > Digital

This Simulator is used to model protocols, hardware, and mapping of behavior to architecture. This simulator is used when the model is being triggered as an event or based on time. The Digital Simulator implements the discrete-event Model of Computation (MoC). This Simulator maintains a notion of current time, and processes events chronologically in this time.


2

Traffic


Traffic > Traffic

Traffic block outputs a new Data Structure (DS) at time intervals specified by the "Time_Distribution" setting. A Data Structure is a transaction containing a list of Field Names + Values.


3

Expression_LIst


Behavior > Expression List

The Decision/Expression_List blocks execute a sequence of expressions in order. The default block contains one input and one output. The user can add multiple input and output ports.


4

Text Display


Result > Text > Text_Display

Displays the values arriving on the input port in a text display dialog. This block buffers the display data and updates the screen after the buffer is full.


5

TimeDataPlotter


Results > TimeDataPlotter

This block plots the incoming data on the Y-Axis against the current simulation time on the X-axis. Every wire connected to this block input is considered a separate dataset and plotted separately.


6

Architecture Setup


Hardware_Setup > Architecture_Setup

This block handles all the address mapping, routing, plotting, statistics and debugging for the Hardware Modeling components.



7

DMA


Hardware_Devices > DMADatabase

DMA block that represents a memory controller that sits between the Processor or bus or I_O Block and the Memory bank.



8

RAM


Memory > RAM

This block combines the operation of a basic memory controller (delay function) and the memory array. The block handles pre-fetch, read, write, refresh, and controller operations. The block can be interfaced to any Bus or Memory Controller.



9

Delay


Traffic > Delay

This block delays the incoming Data Structure by the value specified in the Delay Parameter.


10

Parameter


Model > Parameter >Parameter=

Parameter is a variable containing a constant that can be any VisualSim data type, function and/or a RegEx expression.


11

Initialize


Full Library > Source > Event > Initialize

This block can be used to generate a value at the start of a simulation at a zero delay. User can set an Initial Order, 0 being the lowest priority and higher values fire before lower priority Initialize blocks.



12

Statement


Full Library > Defining Flow > Basic Processing > Statement

This is a block that executes a mathematical expression on each of the four Field_Statement lines based on the Data Structure Expression language.



13

If_Else


Full Library > Defining Flow > Basic Processing > If_Else

This is a programming if-else operation. If the condition in the "If_Statement" equates to a true or 1, then the line numbers listed in the "if_Execute" are executed.


14

Variable List


ModelSetup > Variable List

The block is used to define memory locations that can be used in Expression, Decision, Basic Processing (Statement etc.), Smart_Machine, and Virtual_Machine blocks.


15

BusArbiter


Hardware Devices > BusArbiter

The Linear_Bus or Bus Arbiter block is the Arbiter for a Linear Bus or Bus Interface. The Linear bus is a shared bus topology with an arbiter.




16

BusInterface


Hardware Device > Bus_Interface

The Linear_Port or Bus Interface block is used to connect the devices to the Linear_Controller Bus.  The block has a queue for each port. The incoming transactions are queued and the head of the queue is sent to the Controller Queue.



17

IN


Behavior > IN

This block accepts incoming Data Structures or tokens from any OUT/MUX/uEngine/Virtual_Machine blocks and sends a value on the output port. The single parameter called Destination_Name is composed of two parts - the name and the value to be output, separated by ".".



18

OUT


Behavior > OUT

The OUT block accepts Data Structures or tokens arriving in the input port and transmits them as a virtual connections to IN, MUX, NODE, Virtual Machine, and uEngine.




Model Details

The system uses the following blocks: DMA engine, serial port, internal memory, external memory, fixed priority arbiter, and a processor core (traffic generator). See the following figure for more details.

System??Architecture Model

Figure 3. System Architecture Model

There are three data flows:

  1. One DMA channel moving from external to internal memory.
  2. One DMA channel moving from the serial port to internal memory.
  3. Processor reading from external memory.

DMA


Device Interface

Serial Port

Memory

Processor

Bus

Other Details:

Step-by-Step Instructions

  1. Instantiate the Digital Simulator block. Select the digitalDomainOnly option to use the high-speed version.
  2. Add parameters. In this case, the parameters are the Simulation Time; Serial port frequency and FIFO depth; SRAM, DRAM and Bus speed; and Serial Data Bytes size.

Sim_Time                          :  100.0E-6

Serial_FIFO_Bytes             :  32

S_Port_Rate                      :  1.0E6

Ext_DRAM_Speed_Mhz     :  133.0

Int_DRAM_Speed_Mhz       :  166.0

Bus_Speed_Mhz                :  166.0

S_Data_Bytes                    :   4

  1. Link the Simulation Time with the Digital Simulator stopTime.
  2. Instantiate the Architecture_Setup block.

arch_setup

Figure 4. Arch_Setup

  1. Instantiate a Linear Controller or BusArbiter and 3 Bus Interface / Linear Port blocks below them. Configure it as shown in the figure below:

Linear_bus

Figure 5. Linear_Bus

  1. Instantiate the External and Internal Memory (RAM blocks for both).

DRAM_Config

Figure 6. DRAM_Config

  1. Create an ExpressionList block and populate it as shown below. In this tutorial we are using field names to provide DMA parameters, these can be seen in the image below.

Processing

Figure 7. Parameters for Statement

  1. Use Device Interface to specify the source name as "Processor".

Figure 8. Parameters for Device Interface

  1. Connect them to the Bus Interface.

DMA

Figure 9. Before adding DMA

  1. Next instantiate and configure the DMA.

DMA_Ctrl

Figure 10. Parameters for DMA Controller

  1. Instantiate a DeviceInterface block. Specify the source name as DMA_In.


Figure 11. Parameters for Database Block


  1. Instantiate the Traffic and the ExpressionList block to define the attributes to generate the transactions for the Ext_to_Int transfer.

DMA_Next

Figure 12. DMA_Next

  1. Initialize a memory to store the current FIFO depth for the Serial Port.
  2. Attach a Text_Display block to the stats and util output of the Architecture_Setup block. Also, configure the Trigger input to the DMA for returning the acknowledge message.
  3. Connect the two DMA flows and the Processor flows to the Latency processing and TimeDataPlotter for displaying the Latency.

Full_SCR

Figure 13. Full SRC

  1. Use the OUT + IN block to connect the Processor output to the Latency plotter.

Execution



Experiments

Now that we have run the basic model we can start experimenting with it by changing various parameters and observing the differences in the latency plot. The following is a list of some variables we will be changing essentially one by one.

    1. Outstanding_Req_Count in DMA Controller
    2. Data Size from transaction field in ExpressionList2 & ExpressionList8
    3. Burst Size from transaction field in ExpressionList2 & ExpressionList8
    4. Width_Bytes from transaction field in ExpressionList2 & ExpressionList8
    5. Number of Channels and distribution of traffic in DMA Channel and ExpressionList2 & ExpressionList8

We will first establish a baseline for this we will use the following:

Outstanding_Req_Count: {1,1} 
Data Size:                       32
Burst Size:                      32
Width_Bytes:                    4
Number of Channels          1

With that we get the following baseline Latency and throughput numbers:

DMA_Req1_D32_B32_W4_Ch1DMA_txt_Req1_D32_B32_W4_Ch1



Experiment 1:
So we will startup with altering the Outstanding_Req_Count to 2. This will mean that the DMA will dispatch two requests before pausing and waiting for the resonse of the sent requests.
We will be using the following parameters:

Outstanding_Req_Count: {1,1} 
Data Size:                       32
Burst Size:                      32
Width_Bytes:                    4
Number of Channels          1

With that we get the following results

DMA_Req2_D32_B32_W4_Ch1DMA_txt_Req2_D32_B32_W4_Ch1

Note: We see that the transactions are a lot more grouped in the case of uP traffic.We can also see a slight uptake in the throughput. Another point to not is the overall slight increase in latency for uP.


Experiment 2:
Starting off with the baseline, we will now change the Burst Size, taking it down to 16 bytes. Since our Data Size is remaining the same at 32 bytes, setting the burst size to half that means that the transaction will get fragmented into two parts.
We will be using the following parameters:

Outstanding_Req_Count: {1,1} 
Data Size:                       32
Burst Size:                      16
Width_Bytes:                    4
Number of Channels          1

With that we get the following results

DMA_Req1_D32_B16_W4_Ch1DMA_txt_Req1_D32_B16_W4_Ch1

Note: Since the transactions are smaller they take less time to complete, and so we see only a single jump in Latency that we saw in the baseline. Our latency has gone up by a little bit however the more pressing matter is our throughput which has now dropped by about 25%. And since our transactions have not slowed down, this bottleneck is also reflected in the buffer occupancy which has increased compared to the baseline. Should this simulation go on long enought there is a possibility of the buffer getting completely filled up.



Experiment 3:
Now let us look at a bit of an odd configuration, to present an error condition, where Starting off with the baseline, we will now change the Word Size, taking it down to 16 bytes. We will also disconnect the relations Traffic2 and Traffic3 to the rest of the model, leaving only Traffic (Serial write). We will also limit the number of transactions produced by Traffic block to 8 (because each of these transactions are 4 bytes and 8 of these will make 32 bytes, the capacity of our Serial FIFO buffer), triggering a single transaction toward the DMA. This can be done inside by setting the setting the parameter Number_of_transactions to 8. We will also be applying listen-to-port on Req input port of the DMA and simultaneously listen to the output port Ack of the DMA.
We will be using the following parameters:

Outstanding_Req_Count: {1,1} 
Data Size:                       16
Burst Size:                      32
Width_Bytes:                    4
Number of Channels          1

In the Listen to port windows you will see that a single transaction in both of them. As might have assumed, we should have seen 2 transactions coming out of the DMA since our input was 32 bytes and, having set DMA Data Size to 16, the output port Ack is emitting 16 bytes. We can also look at the field variables, the A_bytes field for the input port Req says 32 but A_bytes at the output port Req says 16 (the same as DMA Data size). This clearly shows data lose. The extra 16 bytes were discarded by the DMA.



Experiment 4:
Now let us change the Width_Bytes. This is the width with which the DMA is connected to the interface block. We will set it to 8 bytes or 64-bit.

We will be using the following parameters:

Outstanding_Req_Count: {1,1} 
Data Size:                       32
Burst Size:                      32
Width_Bytes:                    8
Number of Channels          1

We get the following results:

DMA_Req1_D32_B32_W8_Ch1DMA_txt_Req1_D32_B32_W8_Ch1.jpg

We can see that the overall latency is lower, we can also see that the DMA_throughput is also higher. Although it is important to note that the width of the bus is usually an expensive paramerter to alter so is usually not prefered.


Experiment 5:
We will now change the number of channels. Since we currently have two devices using the DMA, we can assign each of them a single channel. We do that by changing the number of chnnels in the DMA controller to 2 and then in ExpressionList2, change input.A_DMA_Channel = {1}  to input.A_DMA_Channel = {2}.

We will be using the following parameters:

Outstanding_Req_Count: {1,1} 
Data Size:                       32
Burst Size:                      32
Width_Bytes:                    4
Number of Channels          2

Here we get the following results

DMA_Req1_D32_B32_W4_Ch2DMA_txt_Req1_D32_B32_W4_Ch2

Now as you can see we have not seen any meaningful positive change. You might thing that there is no reason to do this if we don't see any imporovement, however the important aspect of this is that since we have two separate channels we can configure them separately. This is illustrated in the next experiment.


Experiment 5:
We Continuing from the previous experiment, keeping the channel number at 2. We will now also change the Outstanding_Req_Count for channel 1 only.

We will be using the following parameters:

Outstanding_Req_Count: {2,1} 
Data Size:                       32
Burst Size:                      32
Width_Bytes:                    4
Number of Channels          2

Here we get the following results

DMA_Req2_D32_B32_W4_Ch2DMA_txt_Req2_D32_B32_W4_Ch2

We can see a significant drop in the latency and this is with almost the same throughput as Experiment 4. Here we can also see an importance of multiple variable at the same time.

Buffer occupancy Issue

Now that we have made ourselves familiar with some of the parameters that can affect our results in a bottlenecked situation. we will now address the biggest issure, the increasing latency graph. This clearly means as time goes on the latency will keep on increasing, as we are being limited by our Bus Speed. Since the traffic being generated does not change, we wil eventually fill up our buffer, which can be confirmed by running the simulation for much longer. We wil now try to increase our Bus Speed. You can experiment with a few configurations, a working configuration is 202 Mhz, set this in the Bus_Speed_Mhz global paramerter. Also we will be keeping the configuration from Experiment 5.

With that we get the following results:
DMA_202Mhz_Req2_D32_B32_W4_Ch2DMA_txt_202Mhz_Req2_D32_B32_W4_Ch2


As you can see the latency is stable and no longer increasing. This might make you think that all that experminatation we did was entirely useless. However that is not actually the case. We can demonstract that be using our baseline and only increasing the Bus Speed.

With that we get the following results

DMA_202Mhz_Req1_D32_B32_W4_Ch1DMA_txt_202Mhz_Req1_D32_B32_W4_Ch1.jpg

As you can see the latency is still rising. We can now implement the Outstanding_Req_Count, setting it to {2}.
With that we get the following results.

DMA_202Mhz_Req2_D32_B32_W4_Ch1DMA_txt_202Mhz_Req2_D32_B32_W4_Ch1

Is you can clearly see, without the Outstanding_Req_Count parameter change we would have had to push Bus Speed much higher. With some experimentation, we were able to mitigate that.

Read & Write Sequence

The model at this point has been setup to only conduct a read operation on the Ext_DRAM. We can change that by changing the action sequence used in ExpressionList8. Try changing the DMA related fields as shown below, such that the DMA Controller does a Read followed by a Write.

Full_SCR



Now run the model again and observe the difference in the the output latency.



Notes

                Hint: The bus reorders the transactions based on priority.