In qsys,can PCIe IP receive write request that exceeds 512 bytes?

Altera_Forum · ‎01-11-2013

We design our fpga project based on cyclone iv EP4CGX15 without external ram.

Since PCIe IP spec says: "A Qsys-generated PCI Express Avalon-MM bridge accepts Avalon-MM burst write requests with a burst size of up to 512 bytes."

and that "The bridge splits incoming burst writes that cross a 4 KByte boundary into at least two separate PCI Express packets"

I think the latter is for SOPC builder since Qsys receive write request up to 512bytes.

But,if I transfer much more bytes,such as 4,147,200 bytes to PCIe IP using SGDMA or DMA.Could PCIe IP receive it?and split the burst write?

Thanks for any reply.

Regards~

Altera_Forum · ‎01-11-2013

The SGDMA/DMA will generate a lot of smaller Avalon bursts, each is (probably) processed separately.

The maximum sized PCIe packet is (probably) 128 bytes, so there is probably little reason for the DMA to generate longer bursts.

PCIe supports multiple outstanding requests - and I presume the Altera PCIe master is capable of generating them - so you only get very marginal throughput improvements for very long TLP (short ones are very slow - especially if synchronous).

Altera_Forum · ‎01-11-2013

Thanks dsl.

And our case is that capture video (up to 1080P30)data and send to our TI' soc,one frame per time.

The reason I want to make FPGA send much more data ,such as 4,147,200 bytes,is that then the SOC(PCIe RC)can just write control data to FPGA,just once.

If FPGA burst writes 512 bytes per time,then then if we want to send one frame(1080P30 YUV 4:2:2)data to Our SOC,that is 4,147,200 bytes.The memory for descriptor is also so large that FPGA on-chip memory is not enough.

So,we want to send much more data one time.

Sorry for my bad english and I'm greenhand for FPGA field.

I still have two questions:

1>You meant that DMA/SGDMA can generate multile burst write to PCIe IP.

And if I make DMA/SGDMA to send much more data,how does it know to split the BIG data to the smaller burst write(such as 512 bytes write)?

2>For DMA/SGDMA transferring,what is the processing?

For example,the block is FIFO---->DMA/SGDMA--->PCIe.

The operation is to transfer 512 bytes from FIFO to PCIe.

first case,DMA/SGDMA move data to its internal FIFO first until the all data(512 bytes) is ready,and then transfer these to PCIe.

second case,DMA/SGDMA move data from FIFO to PCIe directly,and some times DMA/SGDMA doesn't move data when FIFO is empty,and until all data(512 bytes)is moved to PCIe.

Which case is right.

So sorry for many naive questions.

Expecting for any reply.

regards~

Altera_Forum · ‎01-11-2013

You need to do some thoughput analysis. I'm not sure you'll actually manage to transfer data at that rate (seems high!) - and then manage to process it somewhere before the next frame arrives.

Thinks: 4MB at 30fps is 120MB/s - about gigabit ethernet speed.

Or, if you have a 120MHz clock (you won't run an fpga much faster), 1 clock per byte, or 4 clocks per 32bit word.

And that is a frame average, you probably need to worry about the slightly higher mid-line pixel clock.

The first thing to realise is that PCIe isn't really a bus protocol (like PCI) but much more like a communications protocol using HDLC frames.

A large PCIe write is split into multiple requests each of typically (but negotiated) 128 bytes, a small number of which can be outstanding at any one time. When the target has actually written the data it sends an ack packet back the the originator - which then knows that the transfer has completed and can send the next fragment.

(Actually it is a bit more complex than that!)

All the state engine work (etc) slows it all down way below the nominal speed of the PCIe link itself.

I don't know exactly how the Avalon PCIe master side works - we've only used the slave (master is a small ppc).

In order to generate a long PCIe request, the initiating PCIe block needs to know that its user (the DMA block in you case) is going to request another cycle, for writes it could be bufferring data until the avalon burst ends (reads are much more tricky).

Once it has decided it has enough data for a PCIe transfer, the PCIe transfer can be initiated. It can then look for more data for the next tranfser.

Somewhere there needs to be a FIFO - to guarantee that the PCIe block can be fed data every clock (for whichever clock is relevant!).

I think this all means that the DMA transfer length can be much larger than 512 bytes, but you may need add some kind of fifo between the video source and the PCIe.

Possibly writing each vidoe line to alternate memory blocks (4k each ?) and transfering each in turn.

(You might need 4 blocks to hangle jitter...)

Some of this would all be easier if there was a dma engine inside the PCIe block (which would take a PCIe address as its target), rather than the PCIe block being an Avalon slave and mapping ranges of Avalon address space to PCIe space.

Altera_Forum · ‎01-11-2013

Thanks very much for your so detail reply.

The maxim clock is 74.25M 16bit video bus.So,the maxim data throughput is sure.

For the throughput,I think it's fine.For DMA uses PCIe 125M clock,and that one clock DMA can move up to 64bit data,in our case it moves 32bit data,4byte.(The video IP sends a 32bit YUYV color plane data)

"A large PCIe write is split into multiple requests each of typically (but negotiated) 128 bytes, a small number of which can be outstanding at any one time."

128bytes?According to the ug_pcie_guide doc,the Qsys support burst write to 512 bytes.According to PCIe spec,it is 4K bytes maxim.So,how does 128bytes come?

"a small number of which",you meant that just "SMALL NUMBER",such as you said transfer each video line(4k bytes) per time?that if the burst write is large,such as 4M Bytes,it will CAN'T be outstanding?

"I don't know exactly how the Avalon PCIe master side works - we've only used the slave (master is a small ppc)."

"Master" you mean the PCIe RC?or Transaction Initiator?

"Slave" you mean the PCIe EP?Or Transaction Completer?

Our case is that FPGA acts as PCIe EP,and perform DMA writing to TI's DSP(PCIe RC) after the Soc write control data to DMA engine,just like you said "there was a dma engine inside the PCIe block".

And thanks again for your suggestions that Transfer each video line one time, but I may think transfer 65535 bytes per time for that if I transfer each video line per time,that will cost more space and more PCIe cycles for sgdma descriptor.

Yes,I use a Avalon-ST DC FIFO for video buffer to avoid overflow.

Altera_Forum · ‎01-25-2013

I haven't read everything carefully, but I think the answer to your question is that "yes", if you are using Qsys, you can program the DMA to transfer a very large amount of data, and the PCIe core will split it up into smaller transfers that will work with the PCIe interface.