PCIE throughput test

Altera_Forum · ‎01-25-2013

Hi All,

I am doing a PCIE throughput test, the test result numbers are quite strange (write is 210MB/s but read is just 60MB/s for PCIE gen1 x1). I would like to ask for your suggestions and correction if there are wrong approaches in my test configuration.

My test configuration is as follow:

+ One board is configured as the Root Port, one board is configured as the Endpoint. PCIE link is gen 1, width x1, MPS 128B. Both boards run Linux OS

+ At Root Port side, we allocate a memory buffer and its size is 4MB. We map the inbound PCIE memory transaction to this buffer.

+ At Endpoint side, we do DMA read/write to the remote buffer and measure throughput. With this test the Endpoint will always be the initiator of transactions.

+ The test result is 214MB/s for EP Write test and it is only 60MB/s for EP Read test. The Write test throughput is reasonable for PCIe Gen1 x1, but the EP Read throughput is too low.

For the RP board, I tested it with PCIE Ethernet e1000e card and get maximum throughput ~900Mbps. I just wonder in the case of Ethernet TX path, the Ethernet card (plays Endpoint role) also does EP Read request and can get high throughput (~110MB/s) with even smaller DMA transfer, so there must be something wrong with my DMA EP Read configuration.

The detail of the DMA Read test can be summarized with below pseudo code:


  dest_buffer = kmalloc(1MB)
  memset(dest_buffer, 0)
  dest_phy_addr = dma_map_single(destination_buffer)
  source_phy_addr = outbound region of Endpoint
  get_time(t1)
  Loop 100 times
     Issue DMA read from source_phy_addr to dest_phy_addr
     wait for DMA read completion 
  get_time(t2) 
  throughput = (1MB * 100)/(t2 - t1)

Any recommendations and suggestion are appreciated. Thanks in advanced!

Altera_Forum · ‎01-25-2013

Have you verified the lengths of the PCIe TLP?

If the DMA isn't generating long enough TLP then you'll get low throughput - but I suspect that even 60MB/s requires reasonable le

ngth TLP.

The other difference between read and write might be due to the extra pipelining that can be done for writes. This will be significant if the initiator only has 1 read TLP outstanding at any time.

I've measured throughout from a small ppc (root port and initiator) to the fpga (slave) - I got similar values for read and write but only 20ns/byte to internal memory (SDRAM is a lot slower). Although that is timed from userspace so includes the copyout.

For our purposes that was enough - after I'd managed to get the ppc's pcie dma working.

Altera_Forum · ‎01-25-2013

Hi dsl,

Thank you for your reply!

We do not have a PCIE analyzer available to check the TLP length on wire, let me double check if there is some registers availble for this statistics.

The throughput of DMA from mem to mem is about 400MB/s, so we can remove the memory from the suspected bottleneck sources.

If we try to initiate the DMA test from the RP, we have the following numbers: RP read 147MB/s, RP write 174MB/s

There are two things that I can think of:

1/ The e1000e (as the EP device) has some better approaches to use its Read DMA to fetch the data from the RP, so that it gets much better throughput to sustain the Gigabit ethernet interface.

2/ This low throughput is a hardware constraint on the EP board. The PCIE core of the EP board support 2 outbound read request outstanding.

Altera_Forum · ‎01-25-2013

You can probably infer the TLP length by using signaltap to look at the memory transfers at the slave end.

That might also show up other 'interesting' artifacts.

Altera_Forum · ‎01-25-2013

Your setup and terminology is confusing me a bit. Normally reads/writes are talked about with respect to the host (root port), but I can't tell if you're asking about this or data flow from RP to the EP (a read from the perspective of the EP, although your psuedo-code appears to be sending data the other way). I understand that the reads are initiated (requested) by the EP. Can you clarify your setup some more?

On the FPGA, what kind of design are you using? Is it one of the Altera-provided designs, or is it custom? Your fpga setup will make a significant impact to TLP size. For example, if you're using Qsys with the Altera PCIe core and your DMA is not sending bursts (or using a bursting interface) to the PCIe core, your TLPs are going to be sub-optimal. Is your role to work on the FPGA design or just the software?

An FPGA sim should show TLP size pretty quickly for data flow in the EP-> RP direction, but not the other way (since that depends on what the host is doing). Your problem is in the RP -> EP direction, then DSL is correct, and putting signaltap on your memory interface is probably a good way to figure out the TLP size (if you only get a small number of bytes grouped in a burst, you're probably seeing very small TLPs). If you're problem is in the other direction, this can help, but I'd recommend a sim instead. I normally use DrivExpress for this sort of sim because it's pretty easy to set up the DMA transactions and host mem and you can then see everything that's going on in the FPGA and the PCIe bus. You could adapt their DMA examples to match your exactly what your driver is doing. I think it would be free in your case, and better than trying to modify the Altera BFM.

What PCIe core are you using in the EP? Two outstanding requests isn't much. Altera HIP cores usually allow up to 32 tags, so this could be your problem.

The NIC card (GigE) throughput is a pretty different beast. Are you also using the e1000e driver? These are pretty well tuned hardware/driver pairs usually.

Altera_Forum · ‎01-28-2013

Someone just added a post to an old thread where an earlier comment had implied that the SGMA was likely to only generate 64bit TLP for write.

Generating long TLP requires some direct links between the DMA and PCIe blocks (think about how long read TLP will be requested). This is easiest if the DMA is part of the PCIe block (which is what I had to use on the ppc), if it is external (like SGMA) I'm not sure how long read TLP get generated.