Intel® Quartus® Prime Software
Intel® Quartus® Prime Design Software, Design Entry, Synthesis, Simulation, Verification, Timing Analysis, System Design (Platform Designer, formerly Qsys)
16597 Discussions

Atomic operation on global memory

Altera_Forum
Honored Contributor II
1,307 Views

Atomic operation on global memory do have some problem based on my code. 

 

The value of global_offset[0] is 508 after the fpga computation, it should be 512.  

__attribute__ ((reqd_work_group_size (128,1,1)))  

__kernel void global_atomic_loop( __global int *global_offset, int num_elem) // num_elem is 512 in my host code. 

{ 

const unsigned int local_id = get_local_id(0);  

if (local_id == 0) 

{  

global_offset[0] = 0; 

} 

barrier(clk_local_mem_fence);  

 

int i; 

for ( i = local_id; i < num_elem; i += 128 )  

{  

int addr = atomic_add(&global_offset[0], 1); 

global_offset[addr + 1] = i; 

} 

}
0 Kudos
7 Replies
Altera_Forum
Honored Contributor II
487 Views

Neither the Nios cpu not the Avalon bus have support for atomic read-modify-write cycles. 

Not only that there is also no data cache snooping. 

So you'll have to try a lot harder to synchronise between different masters.
0 Kudos
Altera_Forum
Honored Contributor II
487 Views

Thanks. Does the OpenCL SDK do the extra work to support the atomic operations on global memory?

0 Kudos
Altera_Forum
Honored Contributor II
487 Views

What are you actually running and where? 

If you think you need barriers or atomic operation, what exactly are the two (or more) 'items' accessing the memory?
0 Kudos
Altera_Forum
Honored Contributor II
487 Views

What is the NDRange size? This line could (most likely will) be executed by multiple work-items conncurrently : global_offset[addr + 1] = i; Also because of that loop you don't know which order the work-items are going to use that atomic add operator (not that you should assume any ordering when using OpenCL in general) so reading the result after the atomic add and using it to index into memory that way doesn't look safe to me. So I'm suspecting a data hazard is causing the issue you are seeing so I recommend refactoring your code to avoid race conditions like these. 

 

OpenCL handles atomics internally so it doesn't need to rely on the interconnect to provide this. Anywhere you perform an atomic operation in your kernel, special hardware will be placed to ensure no data hazards occur for that particular operation. If you use multiple atomic operations they all operate independent of eachother and as a result may need additional synchronization between them depending on if data is shared.
0 Kudos
Altera_Forum
Honored Contributor II
487 Views

I am running the above OpenCL code on Terasic DE5-Net for testing the atomic operations on FPGA, the local/global NDRange size is 128, num_elem is 512, and only one work group exists in our design.  

The global_offset[0] turns out to be 508 (should be 512). However, the same code works well on GPU.  

 

About the lines "int addr = atomic_add(&global_offset[0], 1); global_offset[addr + 1] = i" : I want to keep the atomic_add sequence of 128 work items. I do not think it is not safe since the variable "addr" is private. It is normal that the code has race conditions, and the hardware should preserve the exactness. Further, is there any rule about the atomic operations for global memory to obey for the exactness?  

 

BTY, 1, if I eliminate the " global_offset[addr + 1] = i ", the global_offset[0] is correct: 512. 2, the atomic_add on the local memory, the result is also correct.  

 

About "OpenCL handles atomics internally so it doesn't need to rely on the interconnect to provide this", if two work groups perform an atomic operation on the same global address together, who guarantee the exactness?
0 Kudos
Altera_Forum
Honored Contributor II
487 Views

GPUs schedule work-items in batches called warps or wavefronts depending on the vendor. Sometimes this scheduling masks synchronization issues. That said, after looking at your code again and the behavior you are seeing when you remove the 'global_offset[addr + 1] = i;' this doesn't look like a synchronization issue (my mistake, was a little lost about what your code was attempting to do the first time I looked at it). I recommend opening a service request and attaching this kernel and the host application so that Altera can take a look at this. 

 

If two work-items from different work groups access the same location using an atomic operator the OpenCL hardware is supposed to take care of that. It would be interesting to see if atomic_inc works in the problematic scenerio.
0 Kudos
Altera_Forum
Honored Contributor II
487 Views

Thanks. I have opened a service request.

0 Kudos
Reply