OpenCL Emulator with Parallel Kernels and Channels

Altera_Forum · ‎12-22-2015

When using more than 3 parallel kernels that are interconnected with channels I get strange behavior in the during emulation. The kernels don't seem to actually run in parallel which causes deadlocks. On the device (Socrates II) I can run as many kernels in parallel as I want. Is this a known limitation of the emulator? Are there any tricks to work around it?

Altera_Forum · ‎01-16-2016

Hmmm I've had the opposite problem where everything works perfectly in emulation, but then fails on the actual device.

I think the emulator runs as a sequential program (emulating your kernels by running them one after another). I was able to get a 4 kernel cyclical design with channels running on the emulator. Perhaps there's a bug in your program causing the deadlocks?

Otherwise the channels do seem to behave differently on the emulator and on the device.

Altera_Forum · ‎01-19-2016

If the emulator runs them sequentially that might explain why it doesn't work. I have two kernels that stream in sequential data from main memory into a channel each, an output streamer that writes sequential data to memory and a worker kernel which connects them. There are no cycles. I've had this situation multiple times now and it always works on the device, but always causes trouble in the emulator. The thing is, even on the emulator it works fine if I use small data sets, that is, stream only a few KB. When I use large data sets, the second input streamer kernel is never run (tested by inserting a printf command at the beginning of each kernel).

The emulator, runnign kernels sequentially, seems to try to run the worker kernel before the second input streamer, causing a deadlock.

I'm wondering though, how did you get a cyclical design to work on the emulator. I don't see how that's even possible if the emulator runs kernel sequentially.

Altera_Forum · ‎01-20-2016

I've researched the issue some more and distilled it down to simpler code:


__kernel void in_streamer_a(__global const uint2* in) {
    printf("in streamer a start\n");
    for(uint i = 0; i < SIZE; ++i) {
        write_channel_altera(in_channel, in);
    }
    printf("in streamer end start\n");
}
__kernel void in_streamer_b(__global const uint2* in) {
    printf("in streamer b start\n");
    for(uint i = 0; i < SIZE; ++i) {
        write_channel_altera(in_channel, in);
    }
    printf("in streamer b end\n");
}
__kernel void out_streamer(__global uint2* out) {
    printf("out streamer start\n");
    for(uint i = 0; i < SIZE; ++i) {
        ushort4 value = read_channel_altera(out_channel);
        out = value;
    }
    printf("out streamer end\n");
}
__kernel void worker()
{
    printf("worker start\n");
    for(uint i = 0; i < SIZE; ++i) {
        write_channel_altera(out_channel, read_channel_altera(in_channel) + read_channel_altera(in_channel));
    }
    printf("worker stop\n");
}

When I run all kernels in the emulator with a small SIZE parameter, i get the correct result and the following console output


in streamer a start
in streamer a stop
in streamer b start
in streamer b stop
worker start
worker stop
out streamer start
out streamer stop

The kernels are run sequentially and the intermediate results are buffered in a channel buffer as it seems. This fits well with the official altera documents. However, when I use a large SIZE argument (10240 or more to be precise), I get the following output and a deadlock:


in streamer a start

Only the first kernel is ever run. It just hangs there. It could be that some internal buffer for the used channel is full, but I don't know of any way tho change its size. Maybe it's something else. Using the following channel attribute does not help:


__attribute__((depth(SUFFICIENTLY_HUGE_NUMBER)));

Note: everything works perfectly fine on the device, where the kernels are actually run in parallel.

Altera_Forum · ‎01-20-2016

The cyclical design worked with the emulator because I believe it was simply storing all the channel writes in memory before proceeding with the next kernel. The emulator must have been smart enough to switch kernels upon blocking and return to each afterwards.

When you're launching your kernels I'm assuming you're launching them all to run concurrently from the host correct? (As in all of them are using different command queues).

I'm not sure if there was a way we can direct message each other from here, but perhaps we could collaborate via email or google hangouts?

Altera_Forum · ‎01-20-2016

--- Quote Start ---

The cyclical design worked with the emulator because I believe it was simply storing all the channel writes in memory before proceeding with the next kernel. The emulator must have been smart enough to switch kernels upon blocking and return to each afterwards.

--- Quote End ---

OK, that makes sense.

--- Quote Start ---

When you're launching your kernels I'm assuming you're launching them all to run concurrently from the host correct? (As in all of them are using different command queues).

--- Quote End ---

Yes, but they don't necessarily have to run in parallel. Running them sequentially in the order I call them should work just fine, as long as the channel data can be buffered.

--- Quote Start ---

I'm not sure if there was a way we can direct message each other from here, but perhaps we could collaborate via email or google hangouts?

--- Quote End ---

Sure, you can find my contact information here (https://mscharrer.net/contact/). Shoot me an email and we'll go on from there.