Intel® Quartus® Prime Software
Intel® Quartus® Prime Design Software, Design Entry, Synthesis, Simulation, Verification, Timing Analysis, System Design (Platform Designer, formerly Qsys)
16598 Discussions

OpenCL Emulator with Parallel Kernels and Channels

Altera_Forum
Honored Contributor II
2,070 Views

When using more than 3 parallel kernels that are interconnected with channels I get strange behavior in the during emulation. The kernels don't seem to actually run in parallel which causes deadlocks. On the device (Socrates II) I can run as many kernels in parallel as I want. Is this a known limitation of the emulator? Are there any tricks to work around it?

0 Kudos
5 Replies
Altera_Forum
Honored Contributor II
621 Views

Hmmm I've had the opposite problem where everything works perfectly in emulation, but then fails on the actual device. 

 

I think the emulator runs as a sequential program (emulating your kernels by running them one after another). I was able to get a 4 kernel cyclical design with channels running on the emulator. Perhaps there's a bug in your program causing the deadlocks?  

 

Otherwise the channels do seem to behave differently on the emulator and on the device.
0 Kudos
Altera_Forum
Honored Contributor II
621 Views

If the emulator runs them sequentially that might explain why it doesn't work. I have two kernels that stream in sequential data from main memory into a channel each, an output streamer that writes sequential data to memory and a worker kernel which connects them. There are no cycles. I've had this situation multiple times now and it always works on the device, but always causes trouble in the emulator. The thing is, even on the emulator it works fine if I use small data sets, that is, stream only a few KB. When I use large data sets, the second input streamer kernel is never run (tested by inserting a printf command at the beginning of each kernel). 

 

The emulator, runnign kernels sequentially, seems to try to run the worker kernel before the second input streamer, causing a deadlock. 

 

 

I'm wondering though, how did you get a cyclical design to work on the emulator. I don't see how that's even possible if the emulator runs kernel sequentially.
0 Kudos
Altera_Forum
Honored Contributor II
621 Views

I've researched the issue some more and distilled it down to simpler code: 

 

__kernel void in_streamer_a(__global const uint2* in) { printf("in streamer a start\n"); for(uint i = 0; i < SIZE; ++i) { write_channel_altera(in_channel, in); } printf("in streamer end start\n"); } __kernel void in_streamer_b(__global const uint2* in) { printf("in streamer b start\n"); for(uint i = 0; i < SIZE; ++i) { write_channel_altera(in_channel, in); } printf("in streamer b end\n"); } __kernel void out_streamer(__global uint2* out) { printf("out streamer start\n"); for(uint i = 0; i < SIZE; ++i) { ushort4 value = read_channel_altera(out_channel); out = value; } printf("out streamer end\n"); } __kernel void worker() { printf("worker start\n"); for(uint i = 0; i < SIZE; ++i) { write_channel_altera(out_channel, read_channel_altera(in_channel) + read_channel_altera(in_channel)); } printf("worker stop\n"); }  

 

When I run all kernels in the emulator with a small SIZE parameter, i get the correct result and the following console output 

in streamer a start in streamer a stop in streamer b start in streamer b stop worker start worker stop out streamer start out streamer stop  

The kernels are run sequentially and the intermediate results are buffered in a channel buffer as it seems. This fits well with the official altera documents. However, when I use a large SIZE argument (10240 or more to be precise), I get the following output and a deadlock: 

in streamer a start  

Only the first kernel is ever run. It just hangs there. It could be that some internal buffer for the used channel is full, but I don't know of any way tho change its size. Maybe it's something else. Using the following channel attribute does not help: 

__attribute__((depth(SUFFICIENTLY_HUGE_NUMBER)));  

 

Note: everything works perfectly fine on the device, where the kernels are actually run in parallel.
0 Kudos
Altera_Forum
Honored Contributor II
621 Views

The cyclical design worked with the emulator because I believe it was simply storing all the channel writes in memory before proceeding with the next kernel. The emulator must have been smart enough to switch kernels upon blocking and return to each afterwards. 

 

When you're launching your kernels I'm assuming you're launching them all to run concurrently from the host correct? (As in all of them are using different command queues). 

 

I'm not sure if there was a way we can direct message each other from here, but perhaps we could collaborate via email or google hangouts?
0 Kudos
Altera_Forum
Honored Contributor II
621 Views

 

--- Quote Start ---  

The cyclical design worked with the emulator because I believe it was simply storing all the channel writes in memory before proceeding with the next kernel. The emulator must have been smart enough to switch kernels upon blocking and return to each afterwards. 

--- Quote End ---  

 

OK, that makes sense. 

 

 

--- Quote Start ---  

When you're launching your kernels I'm assuming you're launching them all to run concurrently from the host correct? (As in all of them are using different command queues). 

--- Quote End ---  

 

Yes, but they don't necessarily have to run in parallel. Running them sequentially in the order I call them should work just fine, as long as the channel data can be buffered. 

 

 

--- Quote Start ---  

I'm not sure if there was a way we can direct message each other from here, but perhaps we could collaborate via email or google hangouts? 

--- Quote End ---  

 

Sure, you can find my contact information here (https://mscharrer.net/contact/). Shoot me an email and we'll go on from there.
0 Kudos
Reply