- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
when I design OpenCL and run aocl, it reports that one single load consume 13 RAMs and one single store consume 16 RAMs. e.g __kernel void top_kernel(__global restrict volatile int *a, __global restrict volatile int *b, __global restrict volatile int *c) { int i; for (i = 0; i < 10000; ++i) c = a + b[i]; // it will consume 13x2 + 16 RAMs }Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The RAMs are used as FIFO buffers to minimize the negative effect of stalls caused by off-chip memory accesses which can have variable latency. There are also some extra RAMs used per off-chip memory access as a private cache, but since your kernel arguments have been defined as volatile, that cache will not be used.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
--- Quote Start --- The RAMs are used as FIFO buffers to minimize the negative effect of stalls caused by off-chip memory accesses which can have variable latency. There are also some extra RAMs used per off-chip memory access as a private cache, but since your kernel arguments have been defined as volatile, that cache will not be used. --- Quote End --- Thanks for your reply. Can we disable or decrease the RAM usage?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Other than disabling the cache using volatile, there is nothing else that can be done. Is there a specific reason you want to do this? Removing those FIFOs will have a very large negative impact on performance.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
--- Quote Start --- Other than disabling the cache using volatile, there is nothing else that can be done. Is there a specific reason you want to do this? Removing those FIFOs will have a very large negative impact on performance. --- Quote End --- I also saw that using volatile can disable the cache. I want to apply coarse grained parallel on external memory access, e.g. # pragma unroll for (i = 0; i < 64; ++i) for (j = 0; j < 1000; ++j) ... = a[i * 1000 + j]; It will consume a lot of RAMs (64 * 16 RAMs) while total of RAMs is about 2700 RAMs in arria10. After that, we have little optimization space because of lack of RAMs. Thanks.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Considering the very low memory bandwidth on current FPGA boards and the very high overhead of contention for off-chip memory accesses, you should actually avoid having parallel memory accesses and instead, unroll your memory accesses in a way that they will be coalesced into bigger ones, to minimize the number of ports to external memory. These ports, as you have noticed, waste a lot of space on the FPGA.
In your code example, you are unrolling the i loop, while the memory accesses are not contiguous over the i dimension and hence, you get 64 memory ports. This, apart from very high area usage, will lower your memory bandwidth to near-zero due to constant contention between all those ports. However, if you partially unroll the j loop 64 times, since the accesses are contiguous, you will get a few large coalesced ports with very low area overhead, and you will get very close to theoretical memory bandwidth.- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
--- Quote Start --- Considering the very low memory bandwidth on current FPGA boards and the very high overhead of contention for off-chip memory accesses, you should actually avoid having parallel memory accesses and instead, unroll your memory accesses in a way that they will be coalesced into bigger ones, to minimize the number of ports to external memory. These ports, as you have noticed, waste a lot of space on the FPGA. In your code example, you are unrolling the i loop, while the memory accesses are not contiguous over the i dimension and hence, you get 64 memory ports. This, apart from very high area usage, will lower your memory bandwidth to near-zero due to constant contention between all those ports. However, if you partially unroll the j loop 64 times, since the accesses are contiguous, you will get a few large coalesced ports with very low area overhead, and you will get very close to theoretical memory bandwidth. --- Quote End --- I totally agree with your suggestion. I have tried and it works. Thanks.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page