Intel® Quartus® Prime Software
Intel® Quartus® Prime Design Software, Design Entry, Synthesis, Simulation, Verification, Timing Analysis, System Design (Platform Designer, formerly Qsys)
16556 Discussions

logic utilization over 100%

Altera_Forum
Honored Contributor II
2,420 Views

Hi, 

 

I am doing some OpenCL for FPGA design with the Stratix V PCIe385_n5 board. 

When I compiled my kernel by 

aoc -c kernel.cl -report 

I got the result that the logic utilization is 131% and it may not fit into my board. 

And according to area.rpt,  

Total: LEs = 157243 FFs = 581682 RAMs = 612 DSPs = 8 Global_resources: LEs = 5034 FFs = 9568 RAMs = 52 DSPs = 0 Const_resources: LEs = 2258 FFs = 21264 RAMs = 116 DSPs = 0 LSU_resources: LEs = 2136 FFs = 5513 RAMs = 49 DSPs = 0 FP_resources: LEs = 0 FFs = 0 RAMs = 0 DSPs = 0 Local_mem_resources: LEs = 0 FFs = 0 RAMs = 0 DSPs = 0 Reg_State_resources: LEs = 34455 FFs = 358603 RAMs = 28 DSPs = 0 RAM_State_resources: LEs = 1415 FFs = 1187 RAMs = 74 DSPs = 0 MrgBr_State_resources: LEs = 64629 FFs = 128975 RAMs = 0 DSPs = 0 Other_State_resources: LEs = 1647 FFs = 1647 RAMs = 0 DSPs = 0 Other_resources: LEs = 7061 FFs = 3905 RAMs = 9 DSPs = 8 ------------ LEs: 45.5513 % FFs: 84.2529 % RAMs: 30.3873 % DSPs: 0.503145 % Util: 131.032 %  

The source code itself is not that complex so I think the problems is about the usage of FFs since the number is extremely large. (Is this because I declared all the variables in private memory?) 

I am creating new files which only consist of certain part of the source code, just to measure the resource usage of that part. (Is this a good/correct way to find the problem?) 

Any idea about what may cause the problem is appreciated! 

Thanks in advance.
0 Kudos
5 Replies
Altera_Forum
Honored Contributor II
373 Views

Without seeing the algorithm I can't say for certain it's the __private memory causing this but it's a possiblity. You method of debugging the issue is sound, just keep in mind that if you break up your kernel into pieces that the footprint of each of those pieces will not necessarily add up to the same sum as the kernel as a whole. 

 

Are you using a fixed work-group size, or know how large your work-group size will need to be? The compiler assumes a work-group size of 256 so if you don't need one that large or know it's going to be a fixed size you can specify attributes to let the compiler know this. Often the compiler will create smaller hardware with hints like these.
0 Kudos
Altera_Forum
Honored Contributor II
373 Views

Hi,BadOmen. 

Thank you for your reply! 

 

--- Quote Start ---  

Without seeing the algorithm I can't say for certain it's the __private memory causing this but it's a possiblity. 

--- Quote End ---  

If that's the reason, will it help if I try to use local memory instead of private memory? I mean I understand that private variables are stored in registers, but what is the local memory for real? Will it too use FFs to store variables?  

 

 

--- Quote Start ---  

You method of debugging the issue is sound, just keep in mind that if you break up your kernel into pieces that the footprint of each of those pieces will not necessarily add up to the same sum as the kernel as a whole. 

--- Quote End ---  

I'll keep on trying and I'm wondering the whether the sum of the pieces is tend to be bigger or smaller than that as a whole? 

 

 

--- Quote Start ---  

Are you using a fixed work-group size, or know how large your work-group size will need to be? The compiler assumes a work-group size of 256 so if you don't need one that large or know it's going to be a fixed size you can specify attributes to let the compiler know this. Often the compiler will create smaller hardware with hints like these. 

--- Quote End ---  

 

I didn't specify a work-group size or use local work item id in my kernel so I assume the work-group size was 256. I followed your advice and set the attribute as 

__attribute__((reqd_work_group_size(1,1,1))) 

However, the estimation of logic utilization doesn't change. Can you tell me how does this attribute effect the hardware usage? It doesn't replicate compute unit or do anything like that as num_compute_unit and num_simd_work_items do, right? 

Besides, I've also tried the following methods 

1.resource-driven optimization(-o3), 

2.__attribute__((num_share_resources(16))) 

Neither of these two worked. Maybe the source do require that much resources. 

Thanks for your reply again.
0 Kudos
Altera_Forum
Honored Contributor II
373 Views

1) I don't typically choose my memory type based on resources, private memory is for variables that only the work-item has visibility into. Local memory is for variables that work-items within the same work-group have access to. If my algorithm benefits from work-items sharing data then I use local memory, otherwise I use private memory and I always size my work-group accordingly to trade off resources vs compute efficiency. In general private memory tends to have more bandwidth than local memory because each work item has their own copy but there are always exceptions to those rules depending on the algorithm 

 

2) I can't really answer that since it's algorithm dependent. I wouldn't worry too much about it and just focus on making the smaller pieces even smaller since that should have the same impact on the overall kernel. 

 

3) It's hard to explain but there are times when this has benefits for both area and processing speed. It's algorithm specific and since I haven't seen the kernel it's difficult for me to explain why the utilization remained the same. By specifying a work-group size of 1,1,1 you are saying each work-group has only one work-item. Typically you only do this when there is no need to share data/resources within a work-group. 

 

If you are able to share your code through a service request I think the quality of the answers will improve since it's incredibly difficult to make suggestions like these without seeing the kernel or at a minimum code fragments. There is no general rule of thumb of optimizations that fit all kernels so how you optimize a kernel is very much an algorithm specific thing (regardless of what hardware you use OpenCL on).
0 Kudos
Altera_Forum
Honored Contributor II
373 Views

Hi, BadOmen 

Thanks for your reply. I found the problem later. It is because that I used nested for loops, and didn't unroll them at all. The throughput was pulled down at these bottlenecks so that huge amount of registers are used to create FIFO queues to maintain the waiting data. After unrolling the loops, the kernel works fine now. Thanks again.
0 Kudos
Altera_Forum
Honored Contributor II
373 Views

Still, I have another question here about the usage of constant memory. 

According to the estimation of the compiler, my kernel should have a throughput of 144M work_item/sec. However the real performance is much worse than that, about 1/20 I think. I wonder whether this is because I used the constant memory to store read only argument, which is a large amount of data (about 200M byte). According to the optimization guide, the constant cache is 16KB and if the constant data size is bigger than that, I will suffer large latency. (Am I understanding this right?) If so, my option can only be using global const qualifier, right?
0 Kudos
Reply