Nios® V/II Embedded Design Suite (EDS)
Support for Embedded Development Tools, Processors (SoCs and Nios® V/II processor), Embedded Development Suites (EDSs), Boot and Configuration, Operating Systems, C and C++
12590 Discussions

TCM speed advantage over onchip RAM?

Altera_Forum
Honored Contributor II
1,050 Views

(Please excuse if my ignorance of cache and bus technologies shows through my question) 

 

My NIOS FPGA application only uses on-chip RAM for code and data storage (no ext mem). Is there any speed advantage to making all my NIOS on-chip memory connected as tightly coupled memory as opposed to plain old Avalon MM bussed memory as the "my first" tutorials always instruct? 

 

Thanks for your insight, John Speth
0 Kudos
6 Replies
Altera_Forum
Honored Contributor II
327 Views

it will reduce latency, so you should see a gain if you have a lot of memory accesses, especially random. The actual gain that you have in practise depends a lot on your application.

0 Kudos
Altera_Forum
Honored Contributor II
327 Views

Using tightly coupled memory also saves you allocating resources to the caches. 

You'll still need a minimal instruction cache if you are using the jtag loader.
0 Kudos
Altera_Forum
Honored Contributor II
327 Views

The access time of a tightly coupled memory will be equivalent to the cache access time when a cache hit occurs. Many call tightly coupled memories "scratch pads" since they have low latency access times like a cache and are recommended when you want to work on data 'locally'. Since not all cache accesses hit, a tightly coupled memory will achieve higher performance but how much high depends on the algorithm and it's memory access patterns. 

 

Caches help when you have 'temporal' and 'spacial' locality accesses. Temporal locality means you access the same memory location frequently so having the data cached saves the CPU cycles fetching and storing to main memory multiple times. Spacial locality only comes into play when you set the cache line size to be greater that 4 bytes/line (native word size of Nios II). The Nios II instruction cache is fixed to 32 bytes/line but the data cache can be configured for 4/16/32 bytes per line. When using 16/32 bytes per line data caches when a cache miss occurs not only that particular word gets loaded into the cache line but the others that map to the same line also get loaded as well. So if you were accessing a 32-bit array sequentially and a particular access resulted in a cache miss, then not only will that array element get loaded but the elements before/after will get loaded if you have a 16/32 byte per line cache. Which elements get loaded has to do with how they are lined up in memory in terms of the address. So spacial locality means that if you access data frequently in the same general location in memory, caches will help minimize main memory accesses assuming the cache line size is greater than the native word size of the processor. 

 

Here are more details about direct mapped caches, when I refer to "lines" I'm talking about the "index" portion of the address: 

 

http://www.laynetworks.com/direct%20mapped%20cache.htm
0 Kudos
Altera_Forum
Honored Contributor II
327 Views

Except that there is little point using the cache when accessing M9K memory blocks. Use the 'dual ports' on the memory so that other avalon masters can access it. 

 

You probably also want to ensure the linker places all readonly data into the data memory (not the code memory) since you don't want to be doing avalon transfers (with or without the data cache) for strings and switch statement jump tables. This probably requires a custom linker script - for a small system start from an emtpy file and add sections as you need them!
0 Kudos
Altera_Forum
Honored Contributor II
327 Views

That's a good point. If you plan on creating a design that has the code run completely on-chip then I would recommend these two configurations to ensure the maximum performance possible: 

 

During Development: 

 

  1. Turn off the data cache 

  2. Reduce the instruction cache to 512B 

  3. Add a dual port on-chip RAM that will be pre-initialized with your code 

  4. Hook up tightly coupled instruction and data masters to the dual port ram
Development Complete (only applicable if you don't plan on keeping the JTAG debug module): 

 

  1. Reduce the instruction cache to 0B (this will remove the instruction master) 

  2. Remove the JTAG debug module 

  3. Regenerate the system 

  4. Recompile the software 

  5. Recompile the hardware
The CPU at this point will still have a data master but no instruction master. The only reason why the instruction cache was present during the development cycle is because you need the instruction master to be connected to the JTAG debug module. I won't go into the details about why the instruction master is removed when you have no instruction cache so just take my word for it. Any time the data cache is removed or set to 4 B/line the data master does not include the read data valid signal so if you high latency reads to perform then you might want to keep the data cache turned (16/32 B/line) and let it perform the accesses since it'll pipeline the reads.
0 Kudos
Altera_Forum
Honored Contributor II
327 Views

Thanks to everyone for the great discussion and extra info. I have a few good ideas to pursue an optimum memory system for my application. 

 

John Speth
0 Kudos
Reply