- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
(Please excuse if my ignorance of cache and bus technologies shows through my question)
My NIOS FPGA application only uses on-chip RAM for code and data storage (no ext mem). Is there any speed advantage to making all my NIOS on-chip memory connected as tightly coupled memory as opposed to plain old Avalon MM bussed memory as the "my first" tutorials always instruct? Thanks for your insight, John SpethLink Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
it will reduce latency, so you should see a gain if you have a lot of memory accesses, especially random. The actual gain that you have in practise depends a lot on your application.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Using tightly coupled memory also saves you allocating resources to the caches.
You'll still need a minimal instruction cache if you are using the jtag loader.- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The access time of a tightly coupled memory will be equivalent to the cache access time when a cache hit occurs. Many call tightly coupled memories "scratch pads" since they have low latency access times like a cache and are recommended when you want to work on data 'locally'. Since not all cache accesses hit, a tightly coupled memory will achieve higher performance but how much high depends on the algorithm and it's memory access patterns.
Caches help when you have 'temporal' and 'spacial' locality accesses. Temporal locality means you access the same memory location frequently so having the data cached saves the CPU cycles fetching and storing to main memory multiple times. Spacial locality only comes into play when you set the cache line size to be greater that 4 bytes/line (native word size of Nios II). The Nios II instruction cache is fixed to 32 bytes/line but the data cache can be configured for 4/16/32 bytes per line. When using 16/32 bytes per line data caches when a cache miss occurs not only that particular word gets loaded into the cache line but the others that map to the same line also get loaded as well. So if you were accessing a 32-bit array sequentially and a particular access resulted in a cache miss, then not only will that array element get loaded but the elements before/after will get loaded if you have a 16/32 byte per line cache. Which elements get loaded has to do with how they are lined up in memory in terms of the address. So spacial locality means that if you access data frequently in the same general location in memory, caches will help minimize main memory accesses assuming the cache line size is greater than the native word size of the processor. Here are more details about direct mapped caches, when I refer to "lines" I'm talking about the "index" portion of the address: http://www.laynetworks.com/direct%20mapped%20cache.htm- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Except that there is little point using the cache when accessing M9K memory blocks. Use the 'dual ports' on the memory so that other avalon masters can access it.
You probably also want to ensure the linker places all readonly data into the data memory (not the code memory) since you don't want to be doing avalon transfers (with or without the data cache) for strings and switch statement jump tables. This probably requires a custom linker script - for a small system start from an emtpy file and add sections as you need them!- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
That's a good point. If you plan on creating a design that has the code run completely on-chip then I would recommend these two configurations to ensure the maximum performance possible:
During Development:- Turn off the data cache
- Reduce the instruction cache to 512B
- Add a dual port on-chip RAM that will be pre-initialized with your code
- Hook up tightly coupled instruction and data masters to the dual port ram
- Reduce the instruction cache to 0B (this will remove the instruction master)
- Remove the JTAG debug module
- Regenerate the system
- Recompile the software
- Recompile the hardware
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks to everyone for the great discussion and extra info. I have a few good ideas to pursue an optimum memory system for my application.
John Speth- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page