Optimize Memory Accesses FPGA

Altera_Forum · ‎11-15-2017

Hi everyone. I am developing a code that has to evaluate some floating point values based on some data. So my current approach is that each evaluation for each cell of an int vector is done by a work item.

This int vector contains the indexes to access some other data. This other data has to be read from global memory and it mainly consist of structures. These structures are used to evaluate the floating point measure I need that will be then saved

to a vector of floating points for each work-item id. I'll provide a snippet of code. So the problem here is that this code scales very well on my GPU but not really for FPGA, actually as I double the work-items the timing increases quite bad.

So here I am very confused how I should optimize the memory accesses... I can't find much online and I am struggling a lot about this problem because it appears to be present in every code I write. So thanks to anyone that can actually clarify or explain the proper approach to optimize the memory accesses. Thanks

int tid = get_global_id(0);    //WORK-ITEM FOR EACH SOLUTION
    int groupIndex = tid * b_sizes;
    int indexWU = tid * b_sizes;
    for (int i = 0; i < b_sizes; i++)
    {
        float totalC = 0;
        int rIndex = b_solutions;
        if (rIndex != -1)
        {
            struct SO o= b_o;
            struct SR r= b_r;
            float freC = 0;
            if (r.sL != CRF)
            {
                float tW = b_wU;
                freC = getWBC(tW, r, b_wB) / tW*o.w;
            }
            float whC = r.whC*o.d;
            totalC += freC + whC;
        }
        b_solPerf += totalC;
    }

Altera_Forum · ‎11-15-2017

You are performing random indirect accesses; the type of behavior you are observing is expected. The FPGA memory controller is extremely inefficient for random accesses and considering the low external memory bandwidth (25-35 GBps), you cannot get much scaling by increasing the number of threads either. Furthermore, the FPGA doesn't have a smart cache that can properly handle redundant random accesses. In contrast, on GPUs you get over 10 times external memory bandwidth, a much more efficient memory controller, and two levels of smart caches. I'm afraid there isn't much you can do to improve the performance of random indirect memory accesses. If you could at least make your accesses direct, it would probably help, but at the end of the day, if you want good memory performance on an FPGA, you need to have large, coalesced and aligned memory accesses.