Question Regarding Buffer Management for AOCL Kernels

Altera_Forum · ‎07-07-2015

Hello everyone,

Under the section "Use Models of AOCL Channels Implementation" of the AOCL programming guide, it says to transfer large data messages between kernels, buffer management could be used. The implementation example given used "__global volatile"as memory type. I am wondering what kind of hardware does this kind of memory maps to? Is it implemented using DDR, same as the normal global memory? Or is it implemented using FIFO, same as the channel?

Also, what is the benefit for using buffer management, comparing to global memory or multiple channels? Does it offers higher throughput or less latency?

Thanks!

Altera_Forum · ‎07-07-2015

BTW: Why is the 3rd management kernel necessary at all? Is it possible to implement cyclic channel access between two kernels without use of this buffer management? Or is there is any negative implication for doing so? For example, why can't I just implementation something like this?

# pragma OPENCL EXTENSION cl_altera_channels : enable

channel float16 feed_forward;

channel float16 feed_back;

__kernel void producer ( ... ){

... Transfer data from global to local memory ...

for (int i=0; i<iteration; i++){

... Computations ...

write_channel_altera(feed_forward, data_in_local_memory);

data_in_local_memory = read_channel_altera(feed_back);

}

__kernel void consumer ( ... ){

for (int i=0; i<iteration; i++){

data_in_local_memory = read_channel_altera(feed_forward);

... Computations ...

write_channel_altera(feed_back, data_in_local_memory);

}

... Transfer data from local back to global memory ...

}

Altera_Forum · ‎07-07-2015

--- Quote Start ---

Hello everyone,

Under the section "Use Models of AOCL Channels Implementation" of the AOCL programming guide, it says to transfer large data messages between kernels, buffer management could be used. The implementation example given used "__global volatile"as memory type. I am wondering what kind of hardware does this kind of memory maps to? Is it implemented using DDR, same as the normal global memory? Or is it implemented using FIFO, same as the channel?

Also, what is the benefit for using buffer management, comparing to global memory or multiple channels? Does it offers higher throughput or less latency?

Thanks!

--- Quote End ---

In this case, I believe data is written to global memory. Essentially the producer stores the data in global memory then sends a token to the consumer that indicates where the data was written in global memory and to go get the data. In terms of benefits, it has the benefit of concurrency. If you're just using global memory, then the host has to handle the movement of data from one kernel to the other which can be slow since the host will have to enqueue read the result from global memory then enqueue it for the consumer kernel. In terms of channels, if there is a large amount of data, instead of creating a really large buffer, writing it to essentially a shared memory region is beneficial because now the producer doesn't have to wait on the consumer if the fifo is full.

--- Quote Start ---

BTW: Why is the 3rd management kernel necessary at all? Is it possible to implement cyclic channel access between two kernels without use of this buffer management? Or is there is any negative implication for doing so? For example, why can't I just implementation something like this?

#pragma OPENCL EXTENSION cl_altera_channels : enable

channel float16 feed_forward;

channel float16 feed_back;

__kernel void producer ( ... ){

... Transfer data from global to local memory ...

for (int i=0; i<iteration; i++){

... Computations ...

write_channel_altera(feed_forward, data_in_local_memory);

data_in_local_memory = read_channel_altera(feed_back);

}

__kernel void consumer ( ... ){

for (int i=0; i<iteration; i++){

data_in_local_memory = read_channel_altera(feed_forward);

... Computations ...

write_channel_altera(feed_back, data_in_local_memory);

}

... Transfer data from local back to global memory ...

}

--- Quote End ---

So with the above approach, if the write and read to the channels are not in sync then contention could occur. Meaning if the consumer has a expensive computation, then the producer stalls because it's waiting on the consumer to read from the channel with the data. With the buffered management, data is written to shared memory which essentially is in the range of GB. Even with buffered channels, you can't hold that much data. This becomes fairly important if you're doing real-time processing. If the producer stalls, then you'll be dropping information waiting for the consumer to open up.

Hope this helps.

Altera_Forum · ‎07-07-2015

Thank you for the help okebz! So basically in buffer management implementation, the channel is used as a tool for synchronization between kernels, whereas the transfer of data is actually done via global memory. The bandwidth and latency of the volatile global memory should be the same as the normal global memory right?

Also, about the declaration of the volatile global memory in the host, should it be declared the same way as the normal global memory? How should the manager kernel be written so that it could know "which regions in memory are free for producer to use"? How could consumer kernel "release" this shared memory? I am wondering about this because the AOCL programming guide only showed the example code for the producer kernel, not the management kernel. Is there a more complete example published somewhere that I could refer to?

Thank again!

Altera_Forum · ‎07-08-2015

Yes that is correct to my understanding.

I'm not too familiar with the programming model, but I think in terms of the host, I think it is declared the same way as normal global memory. I think when you first instantiate the manager kernel, you would feed it region in global memory for the producer to buffer the data for the consumer. The manager then knows the size and area of the "shared memory" and sends a token to the producer indicating the shared region. The consumer releases the shared memory when it sends a message back to the manager indicating that it's done with the data and does not need it anymore. At that time, the space becomes "free" and the producer is able to overwrite the data in that space with new data.

Yes, I was wondering that too myself. In terms of a published example, I am unaware of it, but i think it closely resembles this (https://en.wikipedia.org/wiki/multiple_buffering). Essentially in the "shared memory" The producer and buffer new data while the consumer is accessing another buffer such that there is little stall between the producer and consumer.

Altera_Forum · ‎07-08-2015

Thanks for the help! I will try and see it this works.