Programmable Devices
CPLDs, FPGAs, SoC FPGAs, Configuration, and Transceivers
20705 Discussions

Large delay between DMA transactions

Altera_Forum
Honored Contributor II
2,195 Views

Hello , 

 

Does anybody know why there is a large delay between consecutive DMA transactions ( about 50 us ) ? 

 

I use Quartus 14.0 and Eclipse ARM-DS5 v.14.0 . 

Platform : SoCrates EBV board 

I use built-in DMA Controler (DMAC controler from HPS system) and Altera HWLib (14.0) API like this : 

alt_dma_memory_to_register(Dma_Channel ,&program,dst,src,size,32,false, (ALT_DMA_EVENT_t) 0); 

 

Data are transferred via H2F AXI-bridge from memory to GPIO (mapped as register)  

Single transaction has got the length: 128 (32-bit data) 

I would like to perform as many transactions as possible (one after the other) but it is impossible to call DMA transfers faster than about 50 us. 

Inside a single transfer a throughput is at the level about 307 MBytes/s (quite good) . 

Delay between transactions is independent on the transaction length (still 50 us) 

 

I performed a few tests and it looks that time from "alt_dma_memory_to_register" function call to the time when data are visible is very large (about 40..50 us). Why ? 

 

 

How to solve this problem ?
0 Kudos
8 Replies
Altera_Forum
Honored Contributor II
951 Views

The reason is likely because a lot of error checking and validation goes into building each of the DMA programs. Things such as the following: 

- validating that program buffer have enough space to contain the entire requested program 

- register and memory addresses are aligned 

- all loops well-formed 

- program is well formed 

- program is synced to RAM 

- (list goes on). 

 

Things you can do to improve performance are as follows: 

- Setup the MMU page table and enable all caching (for 14.0, if you are using a flat VM model, it will work correctly. What I call flat is when all virtual addresses matches with the physical addresses). 

- You can try hand editing out the error checking but I don't know how much gains it will get you.
0 Kudos
Altera_Forum
Honored Contributor II
951 Views

 

--- Quote Start ---  

The reason is likely because a lot of error checking and validation goes into building each of the DMA programs. Things such as the following: 

- validating that program buffer have enough space to contain the entire requested program 

- register and memory addresses are aligned 

- all loops well-formed 

- program is well formed 

- program is synced to RAM 

- (list goes on). 

 

Things you can do to improve performance are as follows: 

- Setup the MMU page table and enable all caching (for 14.0, if you are using a flat VM model, it will work correctly. What I call flat is when all virtual addresses matches with the physical addresses). 

- You can try hand editing out the error checking but I don't know how much gains it will get you. 

--- Quote End ---  

 

 

 

 

Hello , 

 

Thank you for reply.  

Probably you are right about possible reasons .  

I have added "alt_cache_system_enable(); //cache enable " 

function to the init procedure , but without success ( no influence at the large delay). 

The question is still open : how we can improve the built-in DMA ? 

At this moment I'm testing different properties of the SoC System , and my test 

programs are very simple , i.e code to test buit-in DMA is like below (main section) : 

 

int main(int argc, char** argv) 

{  

system_init(); //init: gpio , bridge , dma , timer , cache enable 

soc_int_setup(); //setup interrupt 

global_timer_init(); //start timer (i.e. 10 ms interval) 

uart0_init(); //start uart 

generate_test_data(); //generate Write_Buffer[] 

 

while (1) 

if (GLOBAL_TIMER_SEMAPHORE==true) //triggered every 10 ms interval 

GLOBAL_TIMER_SEMAPHORE=false; 

 

for (int i=0 ; i<128 ; i++) // try to call i.e 128 DMA transactions (every transaction has got the  

length 128 32-bits word) 

LED_GPIO_State^= 0x10000000; 

alt_gpio_port_data_write(ALT_GPIO_PORTA, 0x10000000, LED_GPIO_State); // LED 

 

// 

// here is a large delay between LED and the time when DMA data are visible (about 50 us) 

// 

 

//call DMA , DMA transfer time is about. 3.3 us (approx. 307 MBytes/s) 

dma_test_memory_to_register( &Write_Buffer[0], (uint32_t *)(ALT_LWFPGA_BASE +  

ALT_LWFPGA_LED_OFFSET) , 128);  

 

// 

// small delay (approx. 4.1 us) to the time when "ALT_DMA_CHANNEL_STATE_STOPPED " is returned 

// 

 

} //timer 

} //while 

return 0; 

} //main 

 

My questions : 

1) Is it possible to change something in the code or DMA init to remove delays ?  

2) Now I'm trying to implement mSGDMA on the FPGA and compare the built-in DMA and 

the mSGDMA. I have a lot of troubles with connecting mSGDMA to the HPS system  

( not to the Nios !!! ) . Does anybody know how to do it ( especially how to write  

driver code for bare-metal aplication ) ?  

 

Regards  

-jaro
0 Kudos
Altera_Forum
Honored Contributor II
951 Views

Did you setup the pagetable and MMUs before enabling caching? This is necessary for the caching to work. 

 

 

--- Quote Start ---  

Hello , 

 

Thank you for reply.  

Probably you are right about possible reasons .  

I have added "alt_cache_system_enable(); //cache enable " 

function to the init procedure , but without success ( no influence at the large delay). 

The question is still open : how we can improve the built-in DMA ? 

At this moment I'm testing different properties of the SoC System , and my test 

programs are very simple , i.e code to test buit-in DMA is like below (main section) : 

 

... 

 

Regards  

-jaro 

--- Quote End ---  

0 Kudos
Altera_Forum
Honored Contributor II
951 Views

I would throw away the wrapper functions that setup the dma and access the dma registers through a minimal set of wrappers (that get inlined into your code). 

You can speed up accesses to the dma hardware registers (and all other 'small io' registers) by: 

- Putting all the 'small io' below Avalon address 0x8000. 

- Defining a single C structure that matches the layout of the registers. 

- Using 'r0' as a global register variable that points to the structure. 

The compiler should then access all the io locations using an 'offset from r0'. 

(I've done this with gp, but not r0)
0 Kudos
Altera_Forum
Honored Contributor II
951 Views

 

--- Quote Start ---  

Hello , 

 

Thank you for reply.  

Probably you are right about possible reasons .  

I have added "alt_cache_system_enable(); //cache enable " 

function to the init procedure , but without success ( no influence at the large delay). 

The question is still open : how we can improve the built-in DMA ? 

At this moment I'm testing different properties of the SoC System , and my test 

programs are very simple , i.e code to test buit-in DMA is like below (main section) : 

 

int main(int argc, char** argv) 

{  

system_init(); //init: gpio , bridge , dma , timer , cache enable 

soc_int_setup(); //setup interrupt 

global_timer_init(); //start timer (i.e. 10 ms interval) 

uart0_init(); //start uart 

generate_test_data(); //generate Write_Buffer[] 

 

while (1) 

if (GLOBAL_TIMER_SEMAPHORE==true) //triggered every 10 ms interval 

GLOBAL_TIMER_SEMAPHORE=false; 

 

for (int i=0 ; i<128 ; i++) // try to call i.e 128 DMA transactions (every transaction has got the  

length 128 32-bits word) 

LED_GPIO_State^= 0x10000000; 

alt_gpio_port_data_write(ALT_GPIO_PORTA, 0x10000000, LED_GPIO_State); // LED 

 

// 

// here is a large delay between LED and the time when DMA data are visible (about 50 us) 

// 

 

//call DMA , DMA transfer time is about. 3.3 us (approx. 307 MBytes/s) 

dma_test_memory_to_register( &Write_Buffer[0], (uint32_t *)(ALT_LWFPGA_BASE +  

ALT_LWFPGA_LED_OFFSET) , 128);  

 

// 

// small delay (approx. 4.1 us) to the time when "ALT_DMA_CHANNEL_STATE_STOPPED " is returned 

// 

 

} //timer 

} //while 

return 0; 

} //main 

 

My questions : 

1) Is it possible to change something in the code or DMA init to remove delays ?  

2) Now I'm trying to implement mSGDMA on the FPGA and compare the built-in DMA and 

the mSGDMA. I have a lot of troubles with connecting mSGDMA to the HPS system  

( not to the Nios !!! ) . Does anybody know how to do it ( especially how to write  

driver code for bare-metal aplication ) ?  

 

Regards  

-jaro 

--- Quote End ---  

 

 

I've successfully implemented the mSGDMA controler. It's quite simple if you know how to do it http://www.alteraforum.com/forum//images/icons/icon7.png  

If you need DMA to transfer data between FPGA --> HPS SDRAM or HPS SDRAM --> FPGA  

the mSGDMA module is much better than built-in DMA controler. 

 

System description: 

1) mSGDMA module connected to F2S ( FPGA --> SDRAM bridge) , not to the F2H AXI. 

2) mSGDMA setup: Memory -> Stream or Stream -> Memory 

Data width = 64 bits 

Data FIFO Depth =64 

Desc FIFO Depth =64 

Transfer Length = 1kB or 16 kB 

Burst enable , Max. Burst Count = 16 

 

Measurements result: 

 

mSGDMA Transfer Length = 1kByte :  

1024 packets * 1kB , Memory -> Stream , Throughput = approx. 290 MBytes/s (good !!!) 

1024 packets * 1kB , Stream -> Memory , Throughput = approx. 290 MBytes/s (good !!!) 

 

mSGDMA Transfer Length = 16kByte :  

64 packets * 16kB , Memory -> Stream , Throughput = approx. 378 MBytes/s (good !!!) 

64 packets * 16kB , Stream -> Memory , Throughput = approx. 378 MBytes/s (good !!!)  

 

 

For built-in DMA I've got following results : 

 

1 packet of 1 MBytes , Memory -> Memory , Throughput = approx. 423 MBytes/s ( very good !!!) 

1 packet of 1 MBytes , Memory -> Register , Throughput = approx. 38.4 MBytes/s ( poor !!!) 

 

1024 packets of 1kByte , Memory -> Memory , Throughput = approx. 16.1 MBytes/s ( very poor !!!) 

1024 packets of 1kByte , Memory -> Register , Throughput = approx. 11.7 MBytes/s ( very poor !!!) 

 

The last results due to large delay between consecutive DMA transactions !!!! 

 

 

Important notices about mSGDMA and F2S implementation ( poorly documented) : 

 

1) you must to service the signals :msgdma_valid and msgdma_ready on the FPGA side 

(i.e. connect these signals together !!!!) 

2) you must to define ALT_BRIDGE_PROVISION_F2S_SUPPORT =1 in your Makefile (read in HWLib help) 

3) you must to link assembly file like this : alt_bridge_f2s_gnu.s in your Makefile (read in HWLib help) 

4) you must to initialize F2S bridge in your program !!!! 

i.e. alt_bridge_init (ALT_BRIDGE_F2S, NULL , NULL) 

 

 

Regards  

 

-jaro
0 Kudos
Altera_Forum
Honored Contributor II
951 Views

 

--- Quote Start ---  

I've successfully implemented the mSGDMA controler. It's quite simple if you know how to do it http://www.alteraforum.com/forum//images/icons/icon7.png  

If you need DMA to transfer data between FPGA --> HPS SDRAM or HPS SDRAM --> FPGA  

the mSGDMA module is much better than built-in DMA controler. 

 

System description: 

1) mSGDMA module connected to F2S ( FPGA --> SDRAM bridge) , not to the F2H AXI. 

2) mSGDMA setup: Memory -> Stream or Stream -> Memory 

Data width = 64 bits 

Data FIFO Depth =64 

Desc FIFO Depth =64 

Transfer Length = 1kB or 16 kB 

Burst enable , Max. Burst Count = 16 

 

Measurements result: 

 

mSGDMA Transfer Length = 1kByte :  

1024 packets * 1kB , Memory -> Stream , Throughput = approx. 290 MBytes/s (good !!!) 

1024 packets * 1kB , Stream -> Memory , Throughput = approx. 290 MBytes/s (good !!!) 

 

mSGDMA Transfer Length = 16kByte :  

64 packets * 16kB , Memory -> Stream , Throughput = approx. 378 MBytes/s (good !!!) 

64 packets * 16kB , Stream -> Memory , Throughput = approx. 378 MBytes/s (good !!!)  

 

 

For built-in DMA I've got following results : 

 

1 packet of 1 MBytes , Memory -> Memory , Throughput = approx. 423 MBytes/s ( very good !!!) 

1 packet of 1 MBytes , Memory -> Register , Throughput = approx. 38.4 MBytes/s ( poor !!!) 

 

1024 packets of 1kByte , Memory -> Memory , Throughput = approx. 16.1 MBytes/s ( very poor !!!) 

1024 packets of 1kByte , Memory -> Register , Throughput = approx. 11.7 MBytes/s ( very poor !!!) 

 

The last results due to large delay between consecutive DMA transactions !!!! 

 

 

Important notices about mSGDMA and F2S implementation ( poor documented) : 

 

1) you must to service the signals :msgdma_valid and msgdma_ready on the FPGA side 

(i.e. connect these signals together !!!!) 

2) you must to define ALT_BRIDGE_PROVISION_F2S_SUPPORT =1 in your Makefile (read in HWLib help) 

3) you must to link assembly file like this : alt_bridge_f2s_gnu.s in your Makefile (read in HWLib help) 

4) you must to initialize F2S bridge in your program !!!! 

i.e. alt_bridge_init (ALT_BRIDGE_F2S, NULL , NULL) 

 

 

Regards  

 

-jaro 

--- Quote End ---  

 

 

Thanks for your benchmarks. 

 

One of the shortcomings of the alt_dma_*_to_*() APIs is that it attempts to reassemble the entire DMA program whever that API is called. If you were running the transfers on the same addresses, you can call the alt_dma_*_to_*() for the first transfer, keep the ALT_DMA_PROGRAM_t program buffer, then call alt_dma_channel_exec() repeatedly with that buffer. This would greatly reduce the time needed to reassemble (essentially) the same transfer. Clearly if your transfer addresses changes, some adjustments need to be made. 

 

Other tricks to improve performance when you do need to make address adjustments is to update the DAR (destination address register) or SAR (source address register) using alt_dma_program_update_reg() API, however there can be many caveats. The alignments of the SAR and DAR needs to be mod 8 "congruent" to the original addresses. And if you have caching enabled it adds another level of complexity and may not work. Just test your use case before fully relying on this method :eek:. 

 

fdh
0 Kudos
Altera_Forum
Honored Contributor II
951 Views

Hi Jaro, for# 1 at the end of your post you are correct you need to handle the valid signal for the write path and the ready signal in read path. These signals are well documented in the Avalon-ST spec and shorting them together may work in your case but in general data only moves into the write master or out of the read master when valid and ready are both high. When you short them together you loose the ability to provide flow control so if you see redundant/missing samples then that's why because you lost your back-pressure mechanism by shorting them together. 

 

By the way if you want to hide read latency between MM-->ST or MM-->MM transfers using the mSGDMA enable the 'early done enable' bit (24) in the control field of the descriptor. This makes sure that the read master starts working on the next descriptor before all the read data returns from memory.
0 Kudos
Altera_Forum
Honored Contributor II
951 Views

Dear All, 

 

We are utilizing HPS-side DMAC controller and very similar results measured as in case of the -jaro's comment (as saying "built-in DMA"). 

 

My question is why the buffer size is limited in 2 MB when alt_dma_zero_to_memory / alt_dma_memory_to_memory / alt_dma_reg_to_memory API functions are used? 

Symptom: if the buffer size is increased above 2 MB then an alt_dma_event_int_status_get_raw() never returns with ALT_E_SUCCESS. 

 

Neither Cyclone V SoC device handbook, nor ARM's CoreLink DMA 330 Controller technical ref. guide (http://infocenter.arm.com/help/topic/com.arm.doc.ddi0424d/ddi0424d_dma330_r1p2_trm.pdf) mention about this limitation. 

However this latter guide is useful if you want to modify the microcode program of DMAC controller. 

 

Thanks 

 

Zsolt
0 Kudos
Reply