Intel® Quartus® Prime Software
Intel® Quartus® Prime Design Software, Design Entry, Synthesis, Simulation, Verification, Timing Analysis, System Design (Platform Designer, formerly Qsys)
16557 Discussions

can single work items kernels run in parallel on same device

Altera_Forum
Honored Contributor II
1,829 Views

Can single work item kernels run in parallel on the same device (i.e. on the same board). 

 

I've been trying to get a very simple example of task parallelism working but have not been able to  

to get more than one kernel to run at the same time on the same board. 

 

The kernel computes part of summing equations - let's say it sums numbers from "start" to "end". 

In a .cl file there are multiple identical kernels that do this - let's say there at 12 of them. 

 

Single work items kernels have been used to insure that the equation can be pipelined. 

 

The host code creates multiple kernels and multiple contexts in an effort to run more than one in parallel. 

After trying many, many things, I've yet to get them to run in parallel. Initially I used just the time profile 

to see how much time they take to run. Each kernel takes about the same time (e.g. 25 ms). If 12 kernels 

are started, the time is 300 ms.  

 

There are four identical boards in the system. If 12 kernels are used and three are used on each of the four 

boards then each one takes 25 ms but each board can run them in parallel so the total time is only 75 ms. 

 

What else is needed to get the kernels to run in parallel on the same board. I've been able to turn on  

profiling and can see that each one is started - one after the other. 

 

Everything seems to work (i.e. the correct answer is produced) but the kernels don't run at the same time on 

a single board. 

 

Do I need to use NDR range kernels?  

 

Any suggestion would be greatly appreciated! (this should be so hard?!?)
0 Kudos
14 Replies
Altera_Forum
Honored Contributor II
871 Views

Any type of kernel can run in parallel with another, as long as they are invoked in a separate queue, and no event is used to forcibly sentimentalize them; the key point here is that they must run in a different queue and you should not force the host to wait for each kernel execution separately using commands like clFlush() or clFinish(), or by waiting on events. You can, and probably should, wait for an event associated with each kernel invocation, or use clFinish() on every single queue you have, after invoking all the kernels in the host, to make sure all kernels have finished execution, to be then able to read the data back from the device. 

 

Another way this can be accomplished more efficiently is to use replicated autorun kernels; more details about this are available in "Intel FPGA SDK for OpenCL Programming Guide, Section 11.4". 

 

Finally, I need to emphasize on the fact that since external memory bandwidth is shared between the kernels running in parallel, you should not expect to get linear speed-up by using multiple parallel kernels. In fact, assuming that one of your kernels is memory-bound on its own, you will not see any speed-up at all by replicating it. 

 

P.S. I have done this multiple times, and it certainly works.
0 Kudos
Altera_Forum
Honored Contributor II
871 Views

I've tried everything I can think of but I still cannot get my kernel tasks to run in parallel. 

 

The kernels are only doing a simple sum of numbers from "start" to "finish". Other than their result, there is no other memory  

being used. Sharing memory shouldn't be a problem preventing them from running together. 

 

Each kernel is created in its own queue. 

 

No waiting via cl_Finish() is being done. I get an event from each kernel and use it to determine when that kernel is complete 

and print out the time. I can see from that output that they kernels are not running in parallel. 

 

Because then are not running concurrently, but they can, there must be something being shared that is preventing them. 

I don't know what though. 

 

Can anyone take a look at the kernel code and C-code and tell me what is keeping them from running concurrently? 

 

Here's the kernel code: 

__kernel void sumN1(const double start, const double stop, const double step, __global double *z) { //// get index of the work item //int index = get_global_id(0); //init result double sum = 0.0; for (double i=start; i<=stop; i+=step) { sum += i; } z = sum; } // sumN1 ... eight identical kernels ('cept the kernel name) - in the same .cl file __kernel void sumN8(const double start, const double stop, const double step, __global double *z) { //// get index of the work item //int index = get_global_id(0); //init result double sum = 0.0; for (double i=start; i<=stop; i+=step) { sum += i; } z = sum; } // sumN8  

 

 

 

 

Here's "most" of the host C code: 

 

//////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////// // // main - sumN // // - sumN num // int main(int argc, char *argv) { cl_device_id accDeviceID; //device ID of first acceleator device cl_platform_id accPlatformID; //platform ID of platform w/first accelerator device cl_mem answer; //array to collect results std::string binary_file; //name of OpenCL program cl_context context; //OpenCL context for this application cl_device_type deviceType; //type of a device (CPU, GPU, ACCELERATOR) cl_event event; //wait event synchronization handle used by OpenCL API bool foundintel = false; //indicates that Intel FPGA card was found char info_text; //value of some returned text information bool isACC = false; //flag to remember that an accelerator has been found cl_kernel kernel; //OpenCL kernal for this applicaiton int kerns = 0; //number of kernels to use cl_uint numDevices; //number of OpenCL computing devices for a platform cl_uint numPlatforms; //number of OpenCL platforms (typically 1) double number = 0.0; //number to compute sum to cl_program program; //OpenCL program for this application cl_command_queue queue; //OpenCL command queue for this application double result = { 0.0 }; //result of the summation computation size_t size; //size of returned information from OpenCL API double start = 1.0; //start of summing cl_int status; //return code used by OpenCL API double step = 1.0; //step of summing double stop = 1.0; //end of summing cl_int task_done; //info from event query cl_event task_event; //events from tasks .... some code omitted here that handled input args and platform, device setup //////////////////////////////////////// // OpenCL context context = clCreateContext(NULL, 1, &accDeviceID, NULL, NULL, &status); exitOnFail(status, "create context"); //////////////////////////////////////// // OpenCL command queue for ( int kz=0; kz<kerns; kz++) { queue = clCreateCommandQueue(context, accDeviceID, 0, &status); exitOnFail(status, "create command queue"); } //////////////////////////////////////// // Create the program for all device. Use the first device as the // representative device (assuming all device are of the same type). binary_file = getBoardBinaryFile("sumN", accDeviceID); program = createProgramFromBinary(context, binary_file.c_str(), &accDeviceID, 1); //////////////////////////////////////// // Build the program that was just created. status = clBuildProgram(program, 0, NULL, "", NULL, NULL); exitOnFail(status, "Failed to build program"); const double start_time = getCurrentTimestamp(); //////////////////////////////////////// // create the kernel // Create the kernel - name passed in here must match kernel name in the // original CL file, that was compiled into an AOCX file using the AOC tool char kernel_name = "sumNx"; // Kernel name, as defined in the CL file for ( int kz=0; kz<kerns; kz++) { sprintf(kernel_name, "sumN%d", kz+1); // generate the Kernel name, as defined in the CL file kernel = clCreateKernel(program, kernel_name, &status); exitOnFail(status, "Failed to create kernel"); // Set the kernel argument (argument 0) status = clSetKernelArg(kernel, 0, sizeof(cl_double), &start); exitOnFail(status, "Failed to set kernel arg 0"); // Set the kernel argument (argument 1) status = clSetKernelArg(kernel, 1, sizeof(cl_double), &stop); exitOnFail(status, "Failed to set kernel arg 0"); // Set the kernel argument (argument 2) status = clSetKernelArg(kernel, 2, sizeof(cl_double), &step); exitOnFail(status, "Failed to set kernel arg 0"); // last OpenCL argument: memory buffer object for result answer = clCreateBuffer(context, CL_MEM_READ_WRITE | CL_MEM_USE_HOST_PTR, sizeof(cl_double), &result, &status); exitOnFail(status, "create buffer for answer"); // set 4th argument status = clSetKernelArg(kernel, 3, sizeof(cl_mem), &answer); exitOnFail(status, "set kernel argument answer"); } // Launch the kernels for ( int kz=0; kz<kerns; kz++) { status = clEnqueueTask(queue, kernel, 0, NULL, &task_event); exitOnFail(status, "Failed to launch kernel"); } int total_done = 0; int its_done = { 0 }; while (total_done < kerns) { for ( int kz=0; kz<kerns; kz++) { if ( its_done == 0 ) { status = clGetEventInfo(task_event, CL_EVENT_COMMAND_EXECUTION_STATUS, sizeof(cl_int), &task_done, NULL); if (task_done == CL_COMPLETE) { printf("Task:%d complete (%0.3f ms)\n", kz, (getCurrentTimestamp() - start_time) * 1.0e3 ); total_done++; its_done = 1; } else { //printf("Task:%d incomplete\n", kz); } } // if kz task not done } // foreach task event } // wait for kernels to complete //// Wait for command queue to complete pending events //for ( int kz=0; kz<kerns; kz++) { // status = clFinish(queue); // exitOnFail(status, "Failed to finish"); //} const double end_time = getCurrentTimestamp(); // Wall-clock time taken. printf("\nTime: %0.3f ms (%0.3f ms / kernel)\n", (end_time - start_time) * 1e3, (end_time - start_time) * 1e3 / (double)kerns ); for ( int kz=0; kz<kerns; kz++) { printf("Sum 0-%f (step %f) = %f\n", number, step, result); } // Free the resources allocated cleanup(); if(kernel) { for ( int kz=0; kz<kerns; kz++) { clReleaseKernel(kernel); } } if(program) { clReleaseProgram(program); } if(queue) { for ( int kz=0; kz<kerns; kz++) { clReleaseCommandQueue(queue); } } if(context) { clReleaseContext(context); } exit(0); } // main  

 

 

 

This is the output from a run using four kernels. 

There is no parallelism. I've tried using the profiler and it clearly shows that each kernel 

runs, one after the other. 

 

$ bin/host 100000 4 

Reprogramming device [0] with handle 1 

Task:0 complete (3.600 ms) 

Task:1 complete (7.096 ms) 

Task:2 complete (10.583 ms) 

Task:3 complete (14.066 ms) 

 

Time: 14.069 ms (3.517 ms / kernel) 

Sum 0-100000.000000 (step 1.000000) = 5000050000.000000 

Sum 0-100000.000000 (step 1.000000) = 5000050000.000000 

Sum 0-100000.000000 (step 1.000000) = 5000050000.000000 

Sum 0-100000.000000 (step 1.000000) = 5000050000.000000
0 Kudos
Altera_Forum
Honored Contributor II
871 Views

Try putting "start_time = getCurrentTimestamp();" before the kernel invocation loop. I have a feeling your kernel run time is so short that total time is being dominated by the clCreateBuffer() call. Note that on most hardware, using host pointer will result in the OpenCL runtime actually allocating and transferring the whole buffer to device memory. Also considering increasing your input size so that your total run time is at least a few seconds.

0 Kudos
Altera_Forum
Honored Contributor II
871 Views

I did try the longer runs. The result is the same (i.e. no apparent parallelism).  

Here's a snapshot w/8 kernels. 

 

 

$ bin/host 100000000 8 

 

Task:0 complete (3437.999 ms) 

Task:1 complete (6875.815 ms) 

Task:2 complete (10313.553 ms) 

Task:3 complete (13751.281 ms) 

Task:4 complete (17189.009 ms) 

Task:5 complete (20626.756 ms) 

Task:6 complete (24064.509 ms) 

Task:7 complete (27502.250 ms) 

 

 

Time: 27502.254 ms (3437.782 ms / kernel) 

Sum 0-100000000.000000 (step 1.000000) = 5000000050000000.000000 

Sum 0-100000000.000000 (step 1.000000) = 5000000050000000.000000 

Sum 0-100000000.000000 (step 1.000000) = 5000000050000000.000000 

Sum 0-100000000.000000 (step 1.000000) = 5000000050000000.000000 

Sum 0-100000000.000000 (step 1.000000) = 5000000050000000.000000 

Sum 0-100000000.000000 (step 1.000000) = 5000000050000000.000000 

Sum 0-100000000.000000 (step 1.000000) = 5000000050000000.000000 

Sum 0-100000000.000000 (step 1.000000) = 5000000050000000.000000
0 Kudos
Altera_Forum
Honored Contributor II
871 Views

And what about moving the "start_time = getCurrentTimestamp()" call?

0 Kudos
Altera_Forum
Honored Contributor II
871 Views

 

--- Quote Start ---  

And what about moving the "start_time = getCurrentTimestamp()" call? 

--- Quote End ---  

 

 

Same result.... 

 

code update snippet.... 

const double start_time = getCurrentTimestamp(); // Launch the kernels for ( int kz=0; kz<kerns; kz++) { status = clEnqueueTask(queue, kernel, 0, NULL, &task_event); exitOnFail(status, "Failed to launch kernel"); }  

 

results with 8 kernels.... 

 

$ bin/host 100000000 8  

get_plat_info: Intel(R) FPGA SDK for OpenCL(TM) 

Reprogramming device [0] with handle 1  

Task:0 complete (3437.884 ms) 

Task:1 complete (6875.553 ms) 

Task:2 complete (10313.318 ms) 

Task:3 complete (13751.042 ms) 

Task:4 complete (17188.784 ms) 

Task:5 complete (20626.527 ms) 

Task:6 complete (24064.256 ms) 

Task:7 complete (27501.988 ms) 

 

Time: 27501.995 ms (3437.749 ms / kernel) 

Sum 0-100000000.000000 (step 1.000000) = 5000000050000000.000000 

Sum 0-100000000.000000 (step 1.000000) = 5000000050000000.000000 

Sum 0-100000000.000000 (step 1.000000) = 5000000050000000.000000 

Sum 0-100000000.000000 (step 1.000000) = 5000000050000000.000000 

Sum 0-100000000.000000 (step 1.000000) = 5000000050000000.000000 

Sum 0-100000000.000000 (step 1.000000) = 5000000050000000.000000 

Sum 0-100000000.000000 (step 1.000000) = 5000000050000000.000000 

Sum 0-100000000.000000 (step 1.000000) = 5000000050000000.000000 

 

and 

 

results with 4 kernels.... 

 

$ bin/host 100000000 4 

Reprogramming device [0] with handle 1  

Task:0 complete (3437.864 ms) 

Task:1 complete (6875.626 ms) 

Task:2 complete (10313.367 ms) 

Task:3 complete (13751.130 ms) 

 

Time: 13751.135 ms (3437.784 ms / kernel) 

Sum 0-100000000.000000 (step 1.000000) = 5000000050000000.000000 

Sum 0-100000000.000000 (step 1.000000) = 5000000050000000.000000 

Sum 0-100000000.000000 (step 1.000000) = 5000000050000000.000000 

Sum 0-100000000.000000 (step 1.000000) = 5000000050000000.000000 

 

 

 

I have four boards. In other code, I can launch different numbers of kernels of each of the four boards. 

When I do this, I do see the speed up I'm looking for. 

e.g.  

If I run 1 kernel on each of four boards, it takes time X ms. 

But, 

When I run 4 kernels on one board, it takes approx. 4 * X ms (as shown above).
0 Kudos
Altera_Forum
Honored Contributor II
871 Views

I cannot really think of anything else, and I don't see any particular issues in your host code.

0 Kudos
Altera_Forum
Honored Contributor II
871 Views

What about: 

I - Add a clFlush after each clEnqueueTask() 

II - Profile the FPGA design (or print all start and end timestamps of the kernels' events) to see if kernels overlap in time.
0 Kudos
Altera_Forum
Honored Contributor II
871 Views

Maybe a suggestion, not sure if it is an issue, but you could try using CL_MEM_COPY_HOST_PTR instead of CL_MEM_USE_HOST_PTR which will allow multiple copies of the input data to be generated for each cl_mem object rather than having them all point to the same chunk of allocated memory. Also adding the 'restrict' flag to your global variables in the kernel to let it know that no other pointers to the same data are modifying the data. You will need to do an enqueue read buffer to get the data back out since they aren't mapped.  

In my experience, it looks like mapped buffers does the same thing as writeBuffers and readBuffers other than the timing when the kernel reads/writes over PCIe, but it does seem to work well to utilize pinned memory on GPUs. 

 

I am curious if there is an inherit (but unintentional) memory dependency on the global memory.
0 Kudos
Altera_Forum
Honored Contributor II
871 Views

 

--- Quote Start ---  

What about: 

I - Add a clFlush after each clEnqueueTask() 

II - Profile the FPGA design (or print all start and end timestamps of the kernels' events) to see if kernels overlap in time. 

--- Quote End ---  

 

 

I - I thought this was a "no-no" for parallel operations. I'll try it. 

 

II - I did "compile" with the profile on the original code (the above it a watered down version of the real objective) and it pretty clearly showed no over lapping.
0 Kudos
Altera_Forum
Honored Contributor II
871 Views

 

--- Quote Start ---  

What about: 

I - Add a clFlush after each clEnqueueTask() 

--- Quote End ---  

 

This changed the way the kernels ran (each one ran longer) but the over all time was the same. 

i.e. It seems like each kernel was started but it couldn't complete until the previous one completed. 

e.g. Without the clFlush() 

$ bin/host 100000 4 

Reprogramming device [0] with handle 1 

Task:0 complete (4.189 ms) 

Task:1 complete (8.172 ms) 

Task:2 complete (12.137 ms) 

Task:3 complete (16.093 ms) 

 

Time: 16.099 ms (4.025 ms / kernel) 

Sum 0-100000.000000 (step 1.000000) = 5000050000.000000 

Sum 0-100000.000000 (step 1.000000) = 5000050000.000000 

Sum 0-100000.000000 (step 1.000000) = 5000050000.000000 

Sum 0-100000.000000 (step 1.000000) = 5000050000.000000 

 

e.g. w/clFlush() 

$ bin/host 100000 4 

Reprogramming device [0] with handle 1 

Task:0 complete (12.253 ms) 

Task:1 complete (12.283 ms) 

Task:2 complete (12.286 ms) 

Task:3 complete (16.191 ms) 

 

Time: 16.197 ms (4.049 ms / kernel) 

Sum 0-100000.000000 (step 1.000000) = 5000050000.000000 

Sum 0-100000.000000 (step 1.000000) = 5000050000.000000 

Sum 0-100000.000000 (step 1.000000) = 5000050000.000000 

Sum 0-100000.000000 (step 1.000000) = 5000050000.000000 

 

 

 

 

--- Quote Start ---  

 

II - Profile the FPGA design (or print all start and end timestamps of the kernels' events) to see if kernels overlap in time. 

--- Quote End ---  

 

 

https://alteraforum.com/forum/attachment.php?attachmentid=14752&stc=1
0 Kudos
Altera_Forum
Honored Contributor II
871 Views

 

--- Quote Start ---  

but you could try using CL_MEM_COPY_HOST_PTR instead of CL_MEM_USE_HOST_PTR 

--- Quote End ---  

 

This did not help. 

 

 

--- Quote Start ---  

adding the 'restrict' flag to your global variables in the kernel 

--- Quote End ---  

 

This did not help either. 

 

However, this got me tinkering though. I did try changing the CL_MEM_READ_WRITE to CL_MEM_WRITE_ONLY. 

 

This did work!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! 

 

$ bin/host 100000 4 

Reprogramming device [0] with handle 1 

Task:2 complete (4.529 ms) 

Task:3 complete (4.556 ms) 

Task:0 complete (4.559 ms) 

Task:1 complete (4.561 ms) 

 

Time: 4.563 ms (1.141 ms / kernel) 

Sum 0-100000.000000 (step 1.000000) = 5000050000.000000 

Sum 0-100000.000000 (step 1.000000) = 5000050000.000000 

Sum 0-100000.000000 (step 1.000000) = 5000050000.000000 

Sum 0-100000.000000 (step 1.000000) = 5000050000.000000 

 

https://alteraforum.com/forum/attachment.php?attachmentid=14753&stc=1  

 

 

Thanks SO MUCH to nicolacdnll and fand for giving some new suggestions that FINALLY lead to a solution!
0 Kudos
Altera_Forum
Honored Contributor II
871 Views

 

--- Quote Start ---  

However, this got me tinkering though. I did try changing the CL_MEM_READ_WRITE to CL_MEM_WRITE_ONLY. 

--- Quote End ---  

 

 

That sounds like the host compiler/runtime was assuming a false dependency between the answer[] buffers, either because the buffers are defined as an array, or because you are using host pointers. I always use CL_MEM_READ_WRITE for the buffers being accessed by parallel kernels, and never had such problem. However, I do not use host pointers.
0 Kudos
Altera_Forum
Honored Contributor II
871 Views

Thanks HRZ for answering my original post and getting me on a path to a solution!

0 Kudos
Reply