How does the producer, consumer, and manager strategy in the programming guide work?

Altera_Forum · ‎03-02-2018

In Intel FPGA SDK for OpenCL programming guide Page 43, a producer, consumer and manager strategy is mentioned.

There is also reference code for producer.

__kernel void __attribute__((task)) producer(

__global const int* restrict src,

__global volatile int* restrict shared_mem,

const int iterations)

{

int base_offset;

for(int gid = 0; gid < iterations; gid++){

int lid = 0xff & gid;

if(lid == 0){

base_offset = read_channel_intel(req);

}

shared_mem[base_offset + lid] = src[gid];

mem_fence(clk_global_mem_fence | clk_channel_mem_fence);

if(lid == 255){

write_channel_intel(c, base_offset);

}

I searched around, but I didn't see any detailed example code of this strategy.

According to my understanding, the shared_mem is used as the "channel".

Seems it requires the producer and consumer to work on a loop with identical loop count "iterations".

However, a simple regular opencl channel should be able to to the same thing, I don't see there is any reason

using this design.

If the producer does some filtering, for instance, it discards data that is larger than 10 and sends the

rest to the consumer. Basically the loop count in the consumer is not fixed.

In this case, the basic on-chip channel doesn't work while the producer, consumer and manager scheme can't be applied too.

As the producer has no feed back signal sent to the manager, then the manager can't decide if the consumer has finished all the

processing.

In general, I have two questions.

1) Is there a way to do on-chip communication between two kernels using OpenCL when the amount of data communication

is not determined at compilation time. You may take the simple filter as an example. Kernel 0 filters input data streams and extracts the data

that is larger than 10. Kernel 1 gets the output data of Kernel 1 and does some processing.

2) Will the producer, consumer and manager scheme work for the filter example? If not, when will it be used to replace

the basic on-chip channel based design?

Any suggestions will be appreciated.

Regards,

Cheng Liu

Altera_Forum · ‎03-02-2018

--- Quote Start ---

However, a simple regular opencl channel should be able to to the same thing, I don't see there is any reason

using this design.

--- Quote End ---

If you need a shared buffer that is too big to fit on-chip, you can use this design because here, the shared buffer is stored in global memory. If, however, you can get the job done with a small buffer and the latency between the writes from producer and reads from consumer is small, then a standard on-chip channel can do the job just fine.

--- Quote Start ---

1) Is there a way to do on-chip communication between two kernels using OpenCL when the amount of data communication

is not determined at compilation time. You may take the simple filter as an example. Kernel 0 filters input data streams and extracts the data

that is larger than 10. Kernel 1 gets the output data of Kernel 1 and does some processing.

--- Quote End ---

Yes, you can use non-blocking channels. These channels provide a flag using which you can determine if the non-blocking read/write has been successful or not. You can wrap your channel operation in a while(1) loop and use the flag, or even the data received from channel, to break out of the loop. You can find multiple basic examples in Altera's documents.

--- Quote Start ---

2) Will the producer, consumer and manager scheme work for the filter example? If not, when will it be used to replace

the basic on-chip channel based design?

--- Quote End ---

Similar to above, you can also use non-blocking channels with this type of design where the shared buffer is in global memory.

Altera_Forum · ‎03-02-2018

Thank you very much for the reply.

I just moved from Xilinx hls to Altera OpenCL. And I got stuck by the problem.

--- Quote Start ---

If you need a shared buffer that is too big to fit on-chip, you can use this design because here, the shared buffer is stored in global memory. If, however, you can get the job done with a small buffer and the latency between the writes from producer and reads from consumer is small, then a standard on-chip channel can do the job just fine.

Yes, you can use non-blocking channels. These channels provide a flag using which you can determine if the non-blocking read/write has been successful or not. You can wrap your channel operation in a while(1) loop and use the flag, or even the data received from channel, to break out of the loop. You can find multiple basic examples in Altera's documents..

--- Quote End ---

Yes, I did something similar. I inserted an END_OF_FLAG in the data stream. So I can use the END_OF_FLAG to decide if it is end of the processing.

It works when the array size is small. it stalls when the array size is larger than a number around 300000~40000 on both simulation and implementation.

I will double check today and post the code next time. Maybe I missed something critical.

By the way, are you referring to the basic examples in this link?

https://www.altera.com/products/design-software/embedded-software-developers/opencl/developer-zone.html

The four basic examples are: Hello world, vector addition, multi-thread vector operation (C = A + B, C= A * B), OpenCL library (wrapping RTL design with OpenCL).

It seems none of them have "un-deterministic" amount of communication between kernels, as they are working on regular data. If you happen to know the exact example, could you give me a clue... Really appreciated.

--- Quote Start ---

Similar to above, you can also use non-blocking channels with this type of design where the shared buffer is in global memory.

--- Quote End ---

Again thank you for the reply.

Altera_Forum · ‎03-02-2018

--- Quote Start ---

Yes, I did something similar. I inserted an END_OF_FLAG in the data stream. So I can use the END_OF_FLAG to decide if it is end of the processing.

It works when the array size is small. it stalls when the array size is larger than a number around 300000~40000 on both simulation and implementation.

I will double check today and post the code next time. Maybe I missed something critical.

--- Quote End ---

Did you use blocking channel operations or non-blocking? If it was blocking, the behavior is expected since channels have a specific depth and hence, even if you are reading less than you are writing, your program will execute correctly for small streams, but will eventually deadlock for large ones. Since you are also observing the deadlock in simulation, you certainly have a balancing problem; i.e. you are reading less or more than you are writing to the same channel. This will not be hard to debug in the emulator by counting the amount of data that is read or written. This should not happen with a correctly-implemented design using non-blocking channels.

--- Quote Start ---

By the way, are you referring to the basic examples in this link?

https://www.altera.com/products/design-software/embedded-software-developers/opencl/developer-zone.html

--- Quote End ---

No, I was talking about the Intel FPGA SDK for OpenCL Programming Guide. You can find some basic code examples in Sections 5.4.5.2.1 and 5.4.5.3.1 for non-blocking channel operations. I don't think any of Altera's OpenCL example designs use non-blocking channels.

Altera_Forum · ‎03-05-2018

Thanks for the suggestion.

I implemented a design filtering a stream of data as you suggested using non-blocking channel.

(Basically the design scans the input data and selects data that is larger than 10,

then the selected data is multiplied by a constant. Finally, the result is written back to memory. )

It works as expected. I have the kernel code attached in case people in the forum may be interested.

Also the code seems to be a bit complex, so it will be appreciated if you have more efficient implementation.

Some minor questions:

When I use the flag_valid as the end of the condition in k1 and k2,

Some of the data in the tail is not handled. Basically it indicates that

when the flag_valid is captured in k1, the data in ch01 has not been processed.

Even though I tried to add memory fence in the code, the result remains

wrong. I don't actually understand this part.


channel int ch01 __attribute__ ((depth(16)));
channel int ch12 __attribute__ ((depth(16)));
channel int k0_end __attribute__ ((depth(4)));
channel int k1_end __attribute__ ((depth(4)));
// Read data from DDR 
__kernel void __attribute__((task)) k0(
        __global const int* restrict  src,
        const int num)
{
    int i;
    for(i = 0; i < num; i++){
        bool success = false;
        do{
            success = write_channel_nb_altera(ch01, src);
        }
        while(!success);
    }
    //printf("%d data is sent to ch01\n", i);
    //mem_fence(CLK_GLOBAL_MEM_FENCE | CLK_CHANNEL_MEM_FENCE);
    write_channel_altera(k0_end, 1);
}
// Filtering
__kernel void k1(
        const int threshold,
        const int cons)
{
    bool flag_valid = false;
    bool data_valid = false;
    int flag = 0;
    int data;
    int count = 0;
    int total = 0;
    while(true){
        data = read_channel_nb_altera(ch01, &data_valid);
        if(data_valid){
            total++;
            if(data > threshold){
                data = data * cons;
                count++;
                // write to channel12
                bool success = false;
                do{
                    success = write_channel_nb_altera(ch12, data);
                }
                while(!success);
            }
        }
        int tmp = read_channel_nb_altera(k0_end, &flag_valid);
        if(flag_valid) flag = tmp;
        if(flag == 1 && data_valid == false){
            //mem_fence(CLK_CHANNEL_MEM_FENCE);
            write_channel_altera(k1_end, 1);
            break;
        }
    }
    //printf("%d data go through the filter\n", total);
    //printf("%d data is changed.\n", count);
}
// Write data to DDR
__kernel void __attribute__((task)) k2(
        __global int* restrict dst
        )
{
    bool flag_valid = false;
    bool data_valid = false;
    int flag = 0;
    int data;
    int idx = 0;
    while(true){
        data = read_channel_nb_altera(ch12, &data_valid);
        // write data to DDR
        if(data_valid){
            dst = data;
            idx++;
        }
        int tmp = read_channel_nb_altera(k1_end, &flag_valid);
        if(flag_valid) flag = tmp;
        if(flag == 1 && data_valid == false){
            //mem_fence(CLK_GLOBAL_MEM_FENCE | CLK_CHANNEL_MEM_FENCE);
            break;
        }
    }
    //printf("%d data is written to memory.\n", idx);
}

Cheng Liu

Altera_Forum · ‎03-08-2018

Correct me if I wrong: you are saying that in the code you have posted, even if you uncomment the fences, the k0_end flag is received and the k1 kernel jumps out of the while loop before all data from the "ch12" channel is processed?

If this is the case, I can think of a case where this could happen. You should remember that channel reads/writes might not happen every clock. Since the pipeline in the "k1" kernel is longer, it is possible that the distance between reading "ch01" and "k0_end" in kernel "k1" could be longer than the delay to write to both "ch01" and "k0_end" in the "k0" kernel. Hence, it is possible that at a specific cycle, when the "ch0" channel is empty, but the write loop in k0 has not finished yet, the read from "ch0" in kernel "k1" fails and hence, "data_valid' becomes false. After that, by the time the pipeline in kernel "k1" reaches the read from "k0_end", kernel "k0" has written the last data to "ch0" and the end flag to "k0_end"; hence, kernel "k1" sees the exit flag, without processing the data that has been written to the "ch0" channel after the last read attempt.

This is a concurrency issue and cannot be fixed using fences. However, it is very easy to fix by using only one channel instead of two; i.e. instead of sending your exit flag via a separate channel, send it using the same channel that you use for your data. This way, it is impossible for any data to remain in the channel after the exit flag is received. Though you will need to choose some value for the exit condition that will never exist among your data so that you can differentiate between that and your actual data.