for loop pipelined with NDRange

Altera_Forum · ‎01-15-2018

When I use NDRange kernel,In Kernel code

Altera_Forum · ‎01-15-2018

When I use NDRange kernel,

In Kernel code if I have get_global_id

then report will show it's a NDRange kernel, so won't be pipelined.

is there any way to pipelined NDRange kernel? using compute unit will help?

and How can I measure stall time?

I have a (8,256,256) NDRange kernel, and local group size (1,64,64)

but the computation performance is very low when I try to increase local size more.

I think it's because in 64x64 work items, each work item have to wait until all work items in same group finished, then the other 64x64 work items can be launched.

Is that correct? and how to measure the time they wait?

Altera_Forum · ‎01-15-2018

NDRange kernels are pipelined, but the pipelining is at thread level, not loop iteration level. The first step to speed up NDRange kernels is to use the SIMD attribute. You should start from there. Furthermore, there will some limited amount of work-group pipelining; i.e. when threads from one work-group are in flight in one compute unit, threads from the next work-group can also enter the same compute unit.

There could be many reasons why your code is slow. I would say your local group size is pretty big as it is. You should pay careful attention to the way your local memory buffers are implemented; if you have too many non-consecutive accesses to local memory buffers, the compiler will run out of Block RAMs to implement your local buffer and will instead try to share Block RAM ports between different accesses which could result in stalling and low performance. You should also pay attention to global memory accesses and make sure they are consecutive to allow coalescing.

If you want to know stalling percentages, use the profiler.

Altera_Forum · ‎01-16-2018

Thanks for reply

So NDRange kernels are pipelined, but the pipelining is at thread level, not loop iteration level. Is that mean a group size of 64x64 kernel will execute at the same time, but every kernel will execute like normal c program, will not have effect like pipelined as single work item does? Is there any way have hybrid effect?

and where can I check how many compute unit I have. When I double local group size from 1~8, the performance double from 1~8, however, when I keep scaling up from 8 to 16, the performance locked at 8 only increase a little.

and like relu is a very simple SIMD function.

I use:

int i = get_global_id(0)

if(input<0){input=0}

This is also much (100x) slower than using single work item, like:

for(int i=0;i<neuron;i++){

if(input<0){input=0}

}

Howcome this happened? pipelined at thread level is slower than loop iteration level? thread level wasted a lot of time stalling?

Altera_Forum · ‎01-16-2018

Increasing the work-group size will not have much of an effect on performance because it does not increase parallelism, it might just help bring the performance of the pipeline closer to its peak performance. You should use the SIMD attribute if you want to increase parallelism. The number of compute units is by default one. You can use the num_compute_units attribute to increase it. All these attributes are described in Altera's OpenCL SDK documents.

The two kernel snippets you have used should be implemented in a very similar fashion by the compiler and give very similar performance. If the performance is 100x different, I cannot explain it.

Altera_Forum · ‎01-16-2018

So, by default, the compute unit is one, is that mean when I use local group size 64x64, FPGA load 64x64 work items, but won't execute at same time, they have been execute one after another, similar to a for loop, but randomly?

Does this pipeline execute work item when previous work item have finished? or they will be execute partially overlapping depends on algorithm just like pipelined a for loop?

and I am confuse about SIMD and compute unit. I have read best practice guide.

I know when set compute unit to 2, compiler will duplicate 2 kernel, so hardware memory will double. So if I want to increase parallelism, I can duplicate kernel to upper limit of FPGA.

what is different between SIMD and compute unit? Did SIMD also increase hardware memory? what if I increase SIMD too much?

Altera_Forum · ‎01-16-2018

In the absence of SIMD, two threads in a work-group will never be issued at the same time but rather, all the threads will be pipelined one after another (i.e. partial overlapping) in an order that is determined at runtime by the scheduler. So yes, it is similar to a for loop with random order. With SIMD, however, the pipeline will be "widened", allowing multiple threads (as many as the SIMD width) to be issued at the same time, on top of the thread pipelining. With multiple compute units, the whole pipeline and the scheduler and everything will be duplicated, allowing multiple work-groups to run in parallel. Because of this, using multiple compute units has higher area footprint and it will also create more ports to memory which will have an adverse effect on memory performance; in contrast, SIMD has a smaller footprint and could potentially allow consecutive memory access to be coalesced into one bigger access, which will improve memory performance. The difference between SIMD and num_compute_units is also described in "Intel FPGA SDK for OpenCL Best Practices Guide, 6.3.1".

Altera_Forum · ‎01-19-2018

thank you for reply again.

I want to know how can two thread in a work group issued at the same time?

if I have my kernel code absence of SIMD, the execution will be like:

read(); --->WI 2

cal(); --->WI 1

write(); --->WI 0

or

--->WI 1...waiting for WI 0 finish

read();

cal();

write(); --->WI 0

and If I set SIMD to 2, Is the execution like:

read(); --->WI 3, WI 4

cal(); --->WI 2, WI 3

write(); --->WI 0, WI 1

or

--->WI 2, WI 3...waiting

read();

cal();

write(); --->WI 0, WI 1

or SIMD will only increase the performance of reading global memory like:

read(); --->WI 2, WI 3...waiting

cal(); --->WI 1

write(); --->WI 0

Altera_Forum · ‎01-19-2018

This:

--- Quote Start ---

read(); --->WI 2

cal(); --->WI 1

write(); --->WI 0

--- Quote End ---

And this:

--- Quote Start ---

and If I set SIMD to 2, Is the execution like:

read(); --->WI 3, WI 4

cal(); --->WI 2, WI 3

write(); --->WI 0, WI 1

--- Quote End ---

Though the latter will need a minor correction as follows:

read();  --->WI 4, WI 5
cal();    --->WI 2, WI 3
write(); --->WI 0, WI 1

SIMD vectorizes all operations including memory accesses and compute.

Please note that all work-item will have to wait at barriers in NDrange kernels and hence, the pipeline will not extend from one barrier region to another. I am not exactly sure how barriers are handled in this case, but either a long enough delay is inserted into the pipeline to make sure no work-item passes a barrier before all other ones have reached it, or each region between two barriers is mapped to a different pipeline, with the pipeline before each barrier being fully flushed before any of the work-items enter the pipeline after the barrier.