Intel® Quartus® Prime Software
Intel® Quartus® Prime Design Software, Design Entry, Synthesis, Simulation, Verification, Timing Analysis, System Design (Platform Designer, formerly Qsys)

Channel problem

Altera_Forum
Honored Contributor II
1,402 Views

Hi  

 

There are two kernel functions to communicate with each other by using channels. 

 

__kernel void producer(...) 

// N threads 

int sum = get_global_id(0); 

write_channel_altera(ch , sum); 

__kernel void consumer(__global int *out) 

int s ; 

s = read_channel_altera(ch); 

*out+=s; 

 

But if the producer don't know how many times it would transfer data to consumer , 

how do I set the terminate condition between producer and consumer. 

 

Thanks
0 Kudos
9 Replies
Altera_Forum
Honored Contributor II
481 Views

You have to keep track yourself as to how many transfers occur in the producer and consumer. So if you launch N threads for your producer, you should launch N threads for your consumer since every thread in producer sends 1 item and exits.

0 Kudos
Altera_Forum
Honored Contributor II
481 Views

Thank for your reply. 

 

Even if I use pipe to communicate with producer and consumer , I need to know the times that producer will transfer data to consumer ?
0 Kudos
Altera_Forum
Honored Contributor II
481 Views

Yes. The concept of pipes and channels are very similar. Essentially, each write should be accompanied with a read. If you produce something from the producer kernel you would want it to be consumed. Otherwise, the data will just be sitting in the fifo. If you are using blocking channels or pipes, then it will essentially stall until that data is consumed.  

 

So to put it this way, when you launch the kernel from the host side, if you're using NDRange, you need to specify the global work size. This is the number of times your kernel is going to run, or the number of threads you send to the kernel. Each thread will run the kernel, each kernel run sends 1 item and so if your global work size is N, you will send N times. So you should always know how many times the producer sends data on the channel.
0 Kudos
Altera_Forum
Honored Contributor II
481 Views

Thanks !! 

Blocking channel/pipe mean producer will produce data to fifo , and consumer take the data in order . 

NonBlocking is that consumer could coun 

 

So , I can't write the code as below - use a barrier to ensure the data are transferred to fifo , then send a final flag to consumer to finish execution. 

--------------------------------------- 

channel int ch1; 

channel int ch2; 

__kernel void producer(...) //NDRange 

// N threads 

int sum = get_global_id(0); 

write_channel_altera(ch1 , sum); 

mem_fence(CLK_CHANNEL_MEM_FENCE | CLK_GLOBAL_MEM_FENCE); 

 

if(gid ==0){ 

int end = 1;  

write_channel_altera(ch2 , end); 

__kernel void consumer(__global int *out) // Task 

int s , end = 0; 

for(; ; ){ 

s = read_channel_altera(ch1); 

*out+=s; 

end = read_channel_altera(ch2); 

if(end == 1) 

return; 

----------------------------------------------- 

Regards,
0 Kudos
Altera_Forum
Honored Contributor II
481 Views

The code will work, but it's not efficient. Essentially you don't need to send the end flag at all. The code in the first post will do what you want efficiently if you want a producer that produces data and the consumer to accumulate the data. The kernel is only worried about the necessary acceleration and process of data. How much data to be processes should be done on the host side and should be known during run time. You already know how many global work items there are for the producer right? In otherwords you know that the number of threads of producer is N. If you know the number of threads that the producer has, then N should be the number of threads for the consumer. Otherwise if you have more or less threads for the consumer, there will be too much or too little data, and the kernel won't terminate.

0 Kudos
Altera_Forum
Honored Contributor II
481 Views

__attribute__((reqd_work_group_size(NUM_THREAD, 1, 1))) 

__kernel void vectorAdd(__global const float * restrict x, 

__global const float * restrict y, 

int numElements) //numElements = globalSize = 1024000 

// get index of the work item 

int index = get_global_id(0); 

int lid = get_local_id(0); 

if (index >= numElements) {  

return; 

float sum ; 

sum = x[index] + y[index]; 

write_channel_altera(read , sum); 

 

__kernel void sum(__global float * restrict c) //Task 

float sum_add , temp = 0.0; 

c[0] = 0.0 ; 

int cnt = 0; 

for(cnt = 0 ; cnt < globalSize ; cnt++){ 

sum_add = read_channel_altera(read ); 

temp += sum_add; 

*c = temp;  

 

It seems that the code above is not efficient . In my case (DE5-NET) , it needs ~40ms to finish the kernel , and at step ( c[0] = temp ) it needs 8 clock stall (use -g to report) . 

""""""""""""""""""""" 

Successive iterations launched every 8 cycles due to:  

 

Data dependency on variable temp  

Largest Critical Path Contributor:  

96%: Fadd Operation  

"""""""""""""""""'""" 

But the reduction example (altera provides) is doing the same thing , and says that this method is more efficient. 

I don't understand what's the problem about this code . 

 

And I want to ask how many private memory per workitem could use ? 

 

Thanks.
0 Kudos
Altera_Forum
Honored Contributor II
481 Views

In one sense, that seems about right. The kernel takes one thread at a time, meaning it takes 1024000 work item and passes it into the kernel one by one in a pipelined manner. 40 ms doesn't seem to be around the the expected time with those kernels. There are somethings that can be changed that might improve efficiency, but the main limiting factor in my opinion right now is the memory access to global memory rather than the computation of the kernels.  

 

Which reduction altera example are you referring to? 

 

In terms of private memory, my take on it is that there is really no limit to each work item and the limit of the private memory is the limit of the device itself. Meaning you can use as much private memory as you want given that the amount does not exceed the onchip memory of your FPGA.
0 Kudos
Altera_Forum
Honored Contributor II
481 Views

Yes , I think the bottleneck is on the last step (*c = temp) . 

If I removed the "*c = temp" , the kernel can be finished in ~10ms. 

So , in the case I post previously is the most efficient method to reduce the each index ? 

 

reference : https://www.altera.com/content/dam/altera-www/global/en_us/pdfs/products/software/opencl/how-to-do-reductions.pdf
0 Kudos
Altera_Forum
Honored Contributor II
481 Views

I'm not too familiar with what you're trying to do, but in terms of most efficient, if you're just trying to find total sum of two vectors, using two kernels and the way you've structured it might not necessarily be the best idea. It is more efficient to use one kernel than splitting a problem up into two small kernels like you have. You can structure one kernel the same way as the example, except instead of local..+= a..., you do local += a + b[i]. Because now, you can do other tweaking such as doing loop unrolling, simd, etc.  

 

Furthermore, if you're worried about efficiency, if that's the application you're looking into, you have to take into account writing and read from global memory. Since this is all you're doing, it might be more efficient on the CPU than using the FPGA. If you use the FPGA, you have to take the time to do the calculation as well as the time it takes to write the data into the global memory of the FPGA from the CPU, the reading of the data into the computation units on the FPGA, and writing the data back to the CPU.
0 Kudos
Reply