Solved: What causes OpenCL to insert arbitration for local memory accesses?

Björne2 · ‎02-14-2024

I know FPGA OpenCL is deprecated in favor of OneAPI, but I hope you can help me anyway. I've created a MWE of a kernel for which the compiler inserts arbitration:

__attribute__((uses_global_work_offset(0)))
__attribute__((max_global_work_dim(0)))
__kernel void
kmain(uint n_tics,
     __global const volatile uint * restrict dsts) {

    float frontier[100];
    #pragma disable_loop_pipelining
    for (uint i = 0; i < 100; i++) {
        frontier[i] = 0;
    }
    uint nqueue[100];
    uint nqueue_n = 20;
    for (uint t = 0; t < n_tics; t++) {
        for (uint i = 0; i < 100; i++) {
            float tmp = frontier[i];
            frontier[i] = 0;
        }

        for (uint j = 0; j < nqueue_n; j++) {
            uint src=nqueue[j];
            frontier[dsts[src]] += 50;
        }
    }
}

So first I reset all elements of frontier. Then the simulation loop starts and I read one element from frontier and clear it. Then I add 50 to the values at the indexes given by another variable. I know the kernel reads from uninitialized memory, but it's beside the point (I think). In the report aoc complains about "Potentially inefficient configuration" and I can see that it has inserted arbitration circuits (see screenshot).

So the question is why? And how can I fix this memory access pattern to be arbitration-free?

aikeu · ‎02-18-2024

Hi Björne2,

Maybe can refer to this link for OpenCL optimization related.

https://www.youtube.com/watch?v=1zGpN28mXN4

I will try to consult the team if there is any further info which may help.

Thanks.

Regards,

Aik Eu

View solution in original post

aikeu · ‎02-18-2024

Hi Björne2,

Maybe can refer to this link for OpenCL optimization related.

https://www.youtube.com/watch?v=1zGpN28mXN4

I will try to consult the team if there is any further info which may help.

Thanks.

Regards,

Aik Eu

Björne2 · ‎02-19-2024

Oh, the video gave me a hint. The compiler inserts redundant arbitration because the next iteration of the n_tics loop can start before the previous one has finished. So tmp = frontier[i] could be concurrent with frontier[dsts[src]] += 50... So the fix was to add #pragma disable_loop_pipelining to the n_tics loop to tell aoc that iterations of the outer loop can't be concurrent.

aikeu · ‎02-19-2024

Hi Björne2,

I did get some feedback from the team related to your OpenCL question, hope if will also help:

"

It’s arbitrating since because has 3 stores and 2 loads.

Some things to try:

1. Disable pipelining on loop at line 14 (n_tics) -> this should make the inner loops execute consecutively and in theory the LDs and STs can be shared instead of arbitrated.

2. don’t pre-load frontier. This will remove one of the ST sites and simplify the control flow. Add some logic to the loop at line 15 (i = 0:100) to detect if it’s the first iteration and re-use that store site to pre-load frontier.

"

Thanks.

Regards,

Aik Eu

aikeu · ‎02-25-2024

Hi Björne2,

I am closing the thread for now as we do not receive any response from you on the previous question/reply/answer that we have provided. Please login to ‘https://supporttickets.intel.com’, view details of the desire request, and post a feed/response within the next 15 days to allow me to continue to support you. After 15 days, this thread will be transitioned to community support. The community users will be able to help you on your follow-up questions.

Thanks.

Regards,

Aik Eu

What causes OpenCL to insert arbitration for local memory accesses?

Hardware| Optimization| SDK)

OpenCL (BSP