What is the way to lead the OpenCL SDK compiler to reduce kernel logic utilization ?

Altera_Forum · ‎05-27-2013

Hello friends,

I am working on implementing a video processing design on an FPGA using OpenCL . When generating reports for my design,I get 40% of logic utilization and 23 Mworkitems/s as throughput .With this configuration, the throughput we have is too high for the application we are implementing (we only need 3 Mworkitems/s for the application ) .We assume there is no use having a too faster design if the FPGA utilization is too high.

The solution would be to lead the compiler to reduce the logic utilization by decreasing the throughput (for example, we can get 30% of logic utilization and 5 Mworkitems/s as throughput, what would be correct for our design) But when using the "--util" threshold option (aoc -c kernels.cl --util 30 --report) , there is no effect : the logic utilization stays constant and the throughput no longer decrease.

So, what is the the way to reduce logic utilization even if the throughput is decreased ? Any suggestions ?

Thanks

Altera_Forum · ‎05-28-2013

After looking at the programming/optimization guides it looks like you should try -O3 --util 30. Another possibility would be to prevent any loop unrolling by using a smaller loop unroll factor, possibly even one to prevent any unrolling, e.g.,# pragma unroll 1.

Altera_Forum · ‎05-28-2013

Hi sean.settle,

I think this that not solve my issue .There is no optimization attribute in my kernel (no unroll, no vectorization, no replication) .So, the compiler may set these optimization value to 1.

However, I want the compiler to use the trade-off between performance and logic utilization to reduce my logic utilization.

Altera_Forum · ‎05-28-2013

From the wording in the programming/optimization guide, it looks like --util 30 will only work when used with the -O3 flag. Please try it and see if that works.

Altera_Forum · ‎05-28-2013

There is another version of the optimization guide coming that will discuss this more but here are some attributes you can try:

max_share_resources (x)

num_share_resources (y)

max_share_resources tells the compiler to attempt to reuse common portions of the compute unit if it *does not* affect the throughput of the kernel

num_share_resources tells the compiler to attempt to reuse common portions of the compute unit *regardless* if it affects the throughput of the kernel

By the sounds of it you will probably want to use num_share_resources. This attribute gets expored when you run -O3 but you can apply it manually to avoid having to recompile if you already have a specific performance target in mind. There is another attibute called "max_unroll_loops(x)" were you can put a limit on how much unrolling occurs with the kernel (this is also explored with the -O3 flag)

Altera_Forum · ‎05-28-2013

--- Quote Start ---

From the wording in the programming/optimization guide, it looks like --util 30 will only work when used with the -O3 flag. Please try it and see if that works.

--- Quote End ---

Hi sean.settle,

I tried to use this compile option (" -O3 --util" ) but it does give interesting results, as when I use a logic utilization threshold which is less than the logic utilization given in the aoc report, the --util flag never decrease the logic utilization ; do you have a way to explain that ?

Altera_Forum · ‎05-28-2013

--- Quote Start ---

There is another version of the optimization guide coming that will discuss this more but here are some attributes you can try:

max_share_resources (x)

num_share_resources (y)

max_share_resources tells the compiler to attempt to reuse common portions of the compute unit if it *does not* affect the throughput of the kernel

num_share_resources tells the compiler to attempt to reuse common portions of the compute unit *regardless* if it affects the throughput of the kernel

By the sounds of it you will probably want to use num_share_resources. This attribute gets expored when you run -O3 but you can apply it manually to avoid having to recompile if you already have a specific performance target in mind. There is another attibute called "max_unroll_loops(x)" were you can put a limit on how much unrolling occurs with the kernel (this is also explored with the -O3 flag)

--- Quote End ---

Hi BadOmen,

I have already try to use max_share_resources (x) and num_share_resources (y) attribute with my design . Reuse common portion of the hardware involves to factorize the datapath of the compute unit .

But what is very surprising is that when using max_share_resources (x) and num_share_resources (y) attribute with my design , my throughput decreased but logic utilization stayed constant.How that can be explained ?

P.S. : There is no "num_simd_work_items" and "num_compute_unit" attribute in the kernel ; and unrolling pragma

Altera_Forum · ‎05-28-2013

I suspect what happened is that num_share_resources found a small candidate for logic sharing but the change in resources was so minor that in the overall design it doesn't make much of a difference. In general any time you share hardware there is a small logic penalty to implement the sharing logic so if that sharing logic has the same footprint as the logic being shared itself then you could run into results like you have seen.

Are you declaring a reqd_work_group_size or max_work_group_size attribute by any chance? If not I would considering using one of them if possible since you can typically save resources when using them because the kernel hardware will be tailored to what you need. If possible I would use reqd_work_group_size since that will result in the smallest and fastest hardware possible because the hardware will only need to handle a single work group size. Some applications only know the amount of work at runtime but in those cases you can often use reqd_work_group_size and just pad/discard unneed results.

Altera_Forum · ‎05-28-2013

--- Quote Start ---

I suspect what happened is that num_share_resources found a small candidate for logic sharing but the change in resources was so minor that in the overall design it doesn't make much of a difference. In general any time you share hardware there is a small logic penalty to implement the sharing logic so if that sharing logic has the same footprint as the logic being shared itself then you could run into results like you have seen.

Are you declaring a reqd_work_group_size or max_work_group_size attribute by any chance? If not I would considering using one of them if possible since you can typically save resources when using them because the kernel hardware will be tailored to what you need. If possible I would use reqd_work_group_size since that will result in the smallest and fastest hardware possible because the hardware will only need to handle a single work group size. Some applications only know the amount of work at runtime but in those cases you can often use reqd_work_group_size and just pad/discard unneed results.

--- Quote End ---

I was not using "reqd_work_group_size" ,but what I noticed when including "reqd_work_group_size" attribute and running aoc --report is that this attribute has no effect on the size of the hardware generated . For reqd_work_group_size(2,2,1) and reqd_work_group_size(1080,720,1) the logic utilization stayed constant .

I also think this can reduce the amount of hardware, but finally it's probably the number of operations in the kernel which define a number of pipeline stage and therefore the amount of hardware generated . So, if you have a complex kernel, the compiler will generate a high amount of hardware.

So,assuming that "num_share_resources" is the ideal attribute to lead the compiler use utilization/performance trade-off, the way to reduce logic utilization with this attribute is to have a high amount of code blocks that can be shared, so that the overall design make difference. I wonder how this can be done in C code .

Altera_Forum · ‎05-28-2013

Often the reqd_work_group_size attribute has an impact on on-chip memory utiliization since the bigger the work-group size typically the larger the on-chip RAM footprint needs to be to handle all those work-items.

For now num_share_resources is the most appropriate attribute to use but it relies on there being identical portions of the compute unit being present. One way to ensure that there is similar functionality is to code auxiliary functions (sub functions). If you call the auxiliary function from multiple places in the kernel then adding the num_share_resources attribute will hopefully cause the compiler to share that function hardware throughout the kernel instead of creating multiple copies (i.e. inlining).

Do you have any expensive operators in your kernel? Some of the higher level trig functions can become fairly big in hardware so perhaps that is something that can be addressed. Also if you have any calculations that are redundant it would make more sense to calculate those on the host not only for a compute time savings but hardware savings as well. For example if you had something like this:

__kernel (......, float n)

{

a[get_global_id(0)] = b[get_global_id(0)] * log(n) * c[get_global_id(0)];

}

Then you should calculate log(n) on the host instead of having each work-item perform the same calculation, and just pass the value in as a kernel argument.

Altera_Forum · ‎05-28-2013

Thanks BadOmen, I understand the methodology you suggested .I will re-write my kernels in this way very soon, and let you know how performance is impacted.

Cheers