Intel® Quartus® Prime Software
Intel® Quartus® Prime Design Software, Design Entry, Synthesis, Simulation, Verification, Timing Analysis, System Design (Platform Designer, formerly Qsys)
16628 Discussions

OpenCL SDK unable to claculate II


I am looking at the generated report for my matrix multiply OpenCL kernel and the report is telling me that it is unable to estimate the II and the bottleneck is showing as 'n/a'. Any ideas what might be the reason for this? I am using version 18.0 of the tools, and I am attaching a screenshot of the loop analysis report (which shows my code)screenshot.png


Update: Kernel Code below (wasn't able to attach):

__kernel void mysgemmNT_c_fpga_c_c_c(__global float* restrict A, ulong bytes_A, __global float* restrict B, ulong bytes_B, __global float* restrict C, ulong bytes_C, uint mt, uint nt, uint kt) { // unsigned tid = get_global_id(0); // printf("Thread: %d\n", tid); const int size = 1024; __local float rowA[size]; __local float rowC[size]; __local float localB[size*size]; const int n = size; const int m = size; const int k = size; // for (int x = 0; x < m; ++x) // for (int y = 0; y < k; ++y) // localA[y+x*k] = A[y+x*k]; for (int z = 0; z < n; ++z) for (int y = 0; y < k; ++y) localB[y+z*k] = B[y+z*k];   for (int x = 0; x < m; ++x) { for(int y = 0; y < k; ++y) rowA[y] = A[y+x*k];   for (int z = 0; z < n; ++z) { float c = 0.0; for (int y = 0; y < k; ++y) { int indexA = y+x*k; int indexB = z*k+y; // printf("%d: A[%d][%d]=%f,\tB[%d][%d]=%f\n", y,x,y,A[indexA],z,y,B[indexB]); c += rowA[y]*localB[indexB]; } rowC[z] = c; // printf("C[%d][%d] = %f, %f\n", x, z, C[indexC], c); }   for (int z = 0; z < n; ++z) { int indexC = z+x*n; C[indexC] = rowC[z]; } } }


0 Kudos
3 Replies
Valued Contributor III

Can you attach the kernel file? I'd like to generate the report using an earlier version of the compiler and check the report.

0 Kudos

@HRZ​ I added my code to my original post. (For some reason it didn't let me attach a cl file, so I just pasted it in the post itself).

0 Kudos
Valued Contributor III

I checked your code with 16.1.2; all loops are reported to have an II of one while the loop on "x" is limited to 16 "threads" (i.e. every access to the local buffers in the loop will be stalled at least once every 16 iterations) to limit on-chip memory replication. This information, which is VERY important, seems to have been removed in the newer versions of report. I think what the new report is trying to say something along the lines of "II is an approximation due to stallable accesses/inner loops" or something like that.

0 Kudos