FPGA-to-HPS SDRAM vs. HPS-to-FPGA

Altera_Forum · ‎08-26-2014

I'm a relatively new EE (about 5 years EE experience), but I have only a few months of hardware design. I have a design question that I'm hoping someone can shed some light on, teach me a little :-)

I'm taking video in over gigabit ethernet. I will run some video processing on it (on the FPGA fabric) and then pass the processed video to the HPS to be further processed (tracking, etc.). My questions are as follows... because the gigabit ethernet comes in on the HPS side,

(1) would it be best to store the video in HPS DDR3 then access it using the FPGA-to-HPS SDRAM interface, or

(2) pass the video over the HPS-to-FPGA bridge and store it as FPGA DDR3, thus freeing up the HPS SDRAM for other uses?

Thank you.

Altera_Forum · ‎08-27-2014

(3) Best will receive video in HPS-onchip RAM (64K-- on FFFF0000) and send to FPGA-onchip through DMA, fast processing and download back !

Write operation on AMBA is very faster rather than reading, "master" will use DMA for send 1 buffer, then signal to "slave" queue, back they swap roles...

HPS program is in its DDR, and without FPGA-DDR all may be live !

(4) FPGA-part have "reading" pipe for video -- HPS may write direct to these adresses, data fall into FIFO direct, read-write DDR not need. Working starts on fixed level of FIFO, backpressure may be used.

Back FPGA must write data to HPS DDR/RAM and signal to queue in another region, because HPS can't "see" writing operations.

The smaller DDR using, the faster all throughput and simpler scheme !

Altera_Forum · ‎08-27-2014

I apologize, I'm going to repeat some of what you wrote to make sure I understand what you are saying. I can tell english must not be your primary language :-). No worries though, this helps me talk through it:

--- Quote Start ---

(3) Best will receive video in HPS-onchip RAM (64K-- on FFFF0000) and send to FPGA-onchip through DMA, fast processing and download back ! Write operation on AMBA is very faster rather than reading, "master" will use DMA for send 1 buffer, then signal to "slave" queue, back they swap roles... HPS program is in its DDR, and without FPGA-DDR all may be live !

--- Quote End ---

So if I am understanding you correctly, neither of the options I provided are the "best" approach. Instead, I should store video in HPS RAM and send it directly to a FIFO on the FPGA for processing. Therefore, the use of the external DDR3 is not needed and the FPGA would just be checking the FIFO and when there is data (read request), start processing (?). Also, the DMA would live on the FPGA fabric, correct? Or is there a way to utilize DMA on the HPS through a driver?

--- Quote Start ---

(4) FPGA-part have "reading" pipe for video -- HPS may write direct to these adresses, data fall into FIFO direct, read-write DDR not need. Working starts on fixed level of FIFO, backpressure may be used. Back FPGA must write data to HPS DDR/RAM and signal to queue in another region, because HPS can't "see" writing operations. The smaller DDR using, the faster all throughput and simpler scheme !

--- Quote End ---

I think this is just an extension of (3), basically the FPGA would have a FIFO and would be checking the status of the FIFO. When there is video, process. Then write the processed video back to the 64k HPS-onchip RAM (FFFF0000).

Do you have a simple example on how to implement this using (1) Quartus (Qsys) and (2) linux device drivers. Doesn't have to be perfect, I just need some help getting started (esp. with the Linux driver side). I'm starting to understand Qsys a little better, but it is still somewhat of black magic for me.

Thank you!

Altera_Forum · ‎08-28-2014

Yes, and not 2nd, and not 3rd... :) Rus-Eng language without native speakers side by side and only with thousand pages of documentation and Lingvo dictionary... :)

I'm programmer on HPS, FPGA-part from me is far, but may propose architectural optimal scheme, "view from above".

All depends on size of video, which must be in FPGA-part on pipeline processing in one moment -- as many lines and Kilobytes of video, may not use intermediate FPGA-DDR, only onchip RAMs ?

HPS must receive UDP-packets to RAM, send to FPGA region with own DMA, FPGA "see" data, write to own memory (desirable, onchip), process it in pipeline, upload back to HPS and signal about each portion. Or your data flow is not big ?

Linux-side with drivers I not examine: Altera-s documentation not contain such good manuals or direct links to it, boot time and resource usage, overhead charges in Linux is very big.

In Linux may be used standard socket API, drivers for Ethernet is included, only many-many sources... And one example "HelloWorld\n" with long-long DS-5 connection :)

May (1) begin from simple receivig video and sending it back to Eth, then (2) pass through FPGA without procesing, then (3) include simple processing, then (4) complex processing...

I examine simple Baremetal way, have some examples of using hardware, hovewer, for USB and Eth-MAC is no support in HWLIB (hope, in the meaningtime), i will try get Baremetal drivers to it from Linux sources :)

In Qsys I layman, in this forum is visible many screams about read-write between different mems from different sides, I don't know, solved them or not, if downloaded sources and step-by-step rules...

AMBA may make any transactions between any devices in a system, may only know adresses destination relative master side !

...Very-verY NEED be subforum "For Stupid SoC Beginners!" in this forum ! :)

Altera_Forum · ‎09-02-2014

Thank you for your insight WitFed! I'd be interested in your baremetal work you are doing, as future development will most likely involve a baremetal implementation (less overhead).

At any rate, this should get me going for now. I will post back here as questions come up.

--- Quote Start ---

Very-verY NEED be subforum "For Stupid SoC Beginners!" in this forum ! :)

--- Quote End ---

I definitely agree on this!

Altera_Forum · ‎09-04-2014

I didn't read all the posts but I would use the FPGA-to-SDRAM interface. So when the packets come in you can tell the FPGA logic where they live in SDRAM and let your video hardware access it directly. Just make sure cache coherency is maintained because you'll have multiple masters in the system touching this data (EMAC, FPGA, MPU) and accesses through the FPGA-to-SDRAM interface are *not* cache coherent with the MPU L1 and L2 caches.

Altera_Forum · ‎09-12-2014

--- Quote Start ---

I didn't read all the posts but I would use the FPGA-to-SDRAM interface. So when the packets come in you can tell the FPGA logic where they live in SDRAM and let your video hardware access it directly. Just make sure cache coherency is maintained because you'll have multiple masters in the system touching this data (EMAC, FPGA, MPU) and accesses through the FPGA-to-SDRAM interface are *not* cache coherent with the MPU L1 and L2 caches.

--- Quote End ---

Sounds good, but how do you get the physical address of something in Linux. ie if I allocate some memory in Linux I get a pointer that's eg 0x65223000. Now I only have 1GB of Ram attached, so it's clearly not mapped from 0->3FFFFFFF. I assume there is some memory management going on, but how do I translate 0x65223000 to the actual address on the SDRAM?

Altera_Forum · ‎09-12-2014

--- Quote Start ---

Sounds good, but how do you get the physical address of something in Linux. ie if I allocate some memory in Linux I get a pointer that's eg 0x65223000. Now I only have 1GB of Ram attached, so it's clearly not mapped from 0->3FFFFFFF. I assume there is some memory management going on, but how do I translate 0x65223000 to the actual address on the SDRAM?

--- Quote End ---

That's correct, if you malloc in Linux you'll get a virtual address back so if you want to pass that address to the FPGA for something in there to access the same data you'll need to convert the virtual address to a physical address. The other thing to keep in mind is that a large buffer in memory that gets allocated may get mapped to multiple physical addresses because memory is paged. The size of these pages is determined by the MMU settings. On top of that pages can potentially move to new physical locations as well which requires "pinning" to ensure that they don't move around.

Unfortantely I'm a hardware guy so I don't know which API you should use so what I recommend doing is looking at examples that have FPGA logic accessing locations in the HPS to see how those drivers handle this. Most of those should be up on rocketboards: http://rocketboards.org/ Another recommendation I have is to create a new post in the Linux section of the SoC forum since they'll know what you should do.

Altera_Forum · ‎09-17-2014

Ok, thanks for that advice. I've dropped the Linux memory to 768M, and so leave the top 256M for my FPGA to access. I then construct an access to that block from within Linux using mmap:

ptr = mmap(NULL, 0x10000000, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0x30000000);

This works, except that it would seem that access to this memory is MUCH slower than accessing the same chip within Linux address space. So it's not a physical connection problem, but some way Linux deals with memory access via mmap. But mmap should give a virtual pointer to the physical address, and anything else is accessed via a virtual pointer anyway, so why should there be any difference at all?

Altera_Forum · ‎09-23-2014

I googled around and it sounds like MMAP will create uncacheable mappings if the physical address is above main memory (SDRAM in this case). From the processor perspective 0xC000_0000 is indeed above main memory. So I think the slow access is caused by a bunch of non-cacheable accesses being issued in your case.

Since you are trying to set aside a large region of memory for the FPGA to access it might be more manageable to allocate a large continous memory, map it over to physical memory, and access it used DMAs in the FPGA and cacheable accesses from the processor. I have no clue how to do that but the sofltware folks in this forum probably do. I do know it's possible because that's typically how framebuffers are setup in main memory and that's what the OpenCL drivers do as well when you target SoC devices.

Altera_Forum · ‎09-30-2014

--- Quote Start ---

Since you are trying to set aside a large region of memory for the FPGA to access it might be more manageable to allocate a large continous memory, map it over to physical memory, and access it used DMAs in the FPGA and cacheable accesses from the processor. I have no clue how to do that but the sofltware folks in this forum probably do.

--- Quote End ---

Have a look at this example for a starting point: http://rocketboards.org/foswiki/view/projects/myfirstmodule

It is a driver (loadable kernel module) that you need to load using "insmod" in the terminal. The module allocates an area of memory using kmalloc(). The key difference with kmalloc() is that it always allocates physically contigious memory.

You can read the physical address that is allocated at "/sys/bus/platform/drivers/mydriver/buffer_base_phys"

Then the FPGA can access this area using the FPGA2HPS SDRAM port starting at the address found above.