DDR vs. SDR RAM...

Altera_Forum · ‎12-16-2004

Anyone know if DDR Ram will fare better with the NiosII data master than SDRAM?

After Dirk figured out it takes a whopping 12 clocks @50MHz (~240ns) per SDRAM read when not using the dma, we realized we had to respin our board. Jesse explained that the problem is that the Nios' data master is not latency aware and so must use worst case timing.

I'm wondering if DDR would fare better. I'm not familiar with it at all.

Anybody know?

I'm looking for a bulk memory that is also high performance with the NiosII.

Thanks,

Ken

Altera_Forum · ‎12-16-2004

Hi,

Is it a solution to use another type of SDRAM controller that runs on a higher clock speed than the Nios. I think SDRAM can handle more than 100Mc as input clock.

Then probably you can reduce 12 clocks on 50 Mc to 12 clocks on 100Mc.

Can you try to run your Nios on a higher clock frequency? Eventually using another speed grade for the FPGA? I think 80 to 100Mc must be possible, depending on the number of peripherals you connect to the avalon bus.

Stefaan

Altera_Forum · ‎12-16-2004

You're never going to overcome 12 clocks with MHz. You'd need close to 500-1200 MHz to get the performance you should be getting at 100MHz. That's not going to happen.

The only thing we can do right now is use memory that has fixed timing like SRAM or onchip SRAM.

My question was whether DDR is any better/closer in this respect than SDR.

Ken

Altera_Forum · ‎12-17-2004

Sorry, I didn't know the gap in performance was so big.

Altera_Forum · ‎12-17-2004

The SDRAM controller included in the Nios II kit only keeps one bank open at a time.

The DDR controller from Altera keeps multiple banks open.

This should help improve performance but is a function of your access pattern.

The main problem with the Nios II/f data cache and DRAM performance (SDRAM or DDR) is that it only has a 4-byte line so it doesn't perform burst transfers to/from the DRAM.

Altera_Forum · ‎12-17-2004

Hi James,

Can you elaborate on NiosII/f SDRAM access? Jesse indicated the largest part of the problem was that the NiosII/f data master was not "Latency Aware".

So anything not dma'd or read out of the cache incurs a large timing hit as demonstrated in the other thread on this topic -even if the reads are back to back in the same bank.

A work around or a glimpse of the roadmap would sure be welcome.

Thanks,

Ken

Altera_Forum · ‎12-17-2004

Hi Ken,

(I still owe you guys a write-up, it will come soon I swear!)

My eariler comments about latency awareness were a bit mis-guided. A subsequent poster in that thread hit the nail on the head -- in a CPU you can't just queue reads (that is what utilizing latency awareness implies: you 'post' reads and get them back in successsion). The reason makes my earlier statement look a bit dumb: its a processor. There isn't a way to know whether the data you're reading in one instruction has relevance in the next instruction and so forth. For this reason its crutial, for performance, to have things cached.. or utilize some other HW (dma) that can take advantage of latency to shovel things around.

James could probably elaborate more on the above, discussing things such as scroreboarded loads, but I will leave that to him if he wishes as I'm not the processor expert (aside from sitting next to James http://forum.niosforum.com/work2/style_emoticons/<#EMO_DIR#>/smile.gif ).

That said I think I know what James was getting at in the last post: the data cache line size. When a data cache "line" needs to be updated, it is done a line-at-a-time. So if you want faster SDRAM access from the CPU, it can therefore be achieved if your cache lines are big enough to permit latency-awareness (pipelining) of the reads to fill that line; increasing our cache line size would do that. Now, that said, there are probably reasons and ramifications for the reason its only a 32-bit line size... I'll leave that to James.

One thing I'd like to include in my long-overdue write-up is to discuss ways to simplify DMA transfers (make them take fewer instructions to setup) to help alleviate this.

As for your original SDRAM question: I'm afraid I'm not the DDR expert. I'll be learning more about it in the coming months though as our next dev boards will include DDR SDRAM.

Altera_Forum · ‎12-18-2004

Hi Jesse,

I'm still confused as to why the timing diagrams in the SDRAM datasheet show that we can get data in a fraction of of the time it actually takes.

Once the chipselect and address are on the address bus its only a clock or three before the sdram chip has the data ready on the data bus. So where are the other 9-11 clocks consumed? Is there a 6+ clock delay for addresses eminating from the Nios onto the Avalon bus? Or are there long delays delivering the data back into the Nios?

It wasn't that long ago I was writing code on PC's running around 100 MHz with same PC100 memory and random memory reads were 50-60ns if I remember correctly. What gives?

Thanks,

Ken

Altera_Forum · ‎12-20-2004

Hi Ken,

I don't know if you read my previous posts, or they were a bit confusing, but looking at the HDL code generated by SOPC Builder for the Avalon SDRAM controller and at the SDRAM datasheet I see that it takes 7 clocks from the moment the controller gets the read request until it puts the readdata back onto the bus (for CAS latency = 2), assuming that the row is already active and we don't need the RAS phase.

Altera_Forum · ‎12-20-2004

Hi Clancy,

So is the controller pipelined then? Is that the reason Altera can claim one cycle access? Does it pump out successive words every clock after the 7?

Have you learned enough to account for all 12 cycles on the back to back reads from the same row?

Do you think its inevitable or can something be done?

I need fast 15bit table lookups so what I did was respin my Cyclone/SDRAM board with Stratix and SRAM. I should have protos next week. I would like to be able to go back to Cyclone/SDRAM if the core+controller can be made workable.

Thanks,

Ken

Altera_Forum · ‎12-20-2004

Hi Ken,

Yes, the controller is pipelined, so is the SDRAM chip. You can achieve 1 word per clock reads only if you queue up the read requests - in advance - in order to fill up the pipeline (DMA like). The CPU data master cannot request data in advance because usually it cannot predict where the next read will be, so it has to wait until the current read goes all the way through the pipeline before the CPU finishes the current instruction, advances to the next and issue the next read. 7 clocks are spent in the controller/chip, but I am still not sure where the rest of 5 clocks come from and the source code for NiosII is not available.

The NiosII data cache doesn't help in this case, but if YOU know where your reads will be, you can try and write a custom cache controller to optimize the read pipeline (things could get complicated though, and if you need the product fast maybe SRAM is a better option).

Good luck,

clancy

Altera_Forum · ‎12-21-2004

Ken, is the issue the hit rate of the data cache or the time to process a miss?

Since the data cache is a writeback cache with 4 byte lines, every time you have a cache miss,

it can result in a 4-byte write to Avalon (it the victim line is dirty) and then a 4-read from Avalon to fetch the new line.

Because the CPU doesn't have a non-blocking cache, the CPU pipeline stalls while these Avalon transfers are performed.

Would a larger cache line size help your problem? If so, the Avalon reads and writes would be bursts which would tend to

lower the average number cycles on a miss but only if you need the other data in the line. The CPU would still be stalled

while these bursts are happening. More advance CPUs have features like non-blocking caches, scoreboarded loads, and

even out-of-order execution to try to keep the CPU busy while stalled for memory accesses. Alas, Nios II has none of

these features since they are probably too aggressive to implement in an FPGA and achieve acceptable Fmax.

I've designed chips in the past with color space conversion blocks for image processing.

The table accesses were always reads of 4 bytes but were not related to each other (low temporal and spatial locality).

We ended up storing this table in an off-chip SSRAM instead of the SDRAM because it was very wasteful of the SDRAM bandwidth.

To get good performance with SDRAM, you need to make large bursts (e.g. 16 or 32 bytes) and also should have high

temporal and spatial locality of references).

Altera_Forum · ‎12-21-2004

Hello,

please read nios2 sdram performance (http://www.niosforum.com/forum/index.php?act=st&f=2&t=629) for a deeper explanation of the issue. There are even oscilloscope images available.

Dirk

Altera_Forum · ‎12-22-2004

Hi James,

It's the time to process a miss or the time to process a read in the absence of data cache.

I feel like I completely understand your explanation, but I still don't see what it has to do with the worst case time to access sdram.

Forget cache, how long does it take to for an address to show up on the avalon bus and then how long does it take for the sdram to place the requested data back on the bus?

Am I misreading the sdram datasheet? They just show address and chipselect activating and data being placed on the bus in just a few clocks. The number of clocks is equal to the CAS setting for accesses to the same row. (ie. figure 8 "Random Reads" in MT48LC4M32B2 datasheet)

So stalled/cached/queued or not, once the Avalon bus puts out the address and selects the sdram responds in CAS clocks with the result.

So that's 3 clocks with CAS3 setting. Now where are the other 9 clocks coming from? Dirk's scope shots show 12 clocks between back to back reads on the same row. The cpu stalling until the result is returned is fine, because the code needs that result to continue. I'm just not seeing the justification for a 12 clock stall to read a non-cached word from sdram.

Thanks,

Ken

Altera_Forum · ‎12-22-2004

Hi Ken,

when I look back at the diagram I already posted in the first thread regarding this topic, I see that the SDRAM-controller needs 3 clocks to assert CAS after he got chip-selected (internally). +2 clocks CAS-latency +1 clock from the SDRAM-controller (I suppose the input-registers from the SDRAM-data-bus) The remaining 5 cycles appears to get lost somewhere inside the Nios (the CAS to CAS-time was 11 cycles in the case I observed).

Using DDR-SDRAM would not help at all, because it mainly increases the burst-rate (2bits per pin and clock instead of 1), but has basically the same latency-behavior.

You wrote that you need a 15bit look-up-table. If there are really random values to look-up, there will be no suitable solution with nios+SDRAM, I think. I would recommend that you perform the look-up-task by dedicated "hardware" in the FPGA. There you may do some pipelining and achieve 1 value per clock with the SSRAM you mentioned some time ago. With a dedicated SDRAM-controller you could get also a better speed, but it is very difficult to get a high worst-case-performance with the SDRAM for really random accesses. (The orignal problem from dziegel were really predictable sequential accesses, were you can achieve almost 1 access per clock with a non-nios-solution.)

Maybe we could help you more if we get more details from your application.

Regards

Thomas

www.entner-electronics.com (http://www.entner-electronics.com)

Altera_Forum · ‎12-22-2004

Hi Thomas,

In your analysis you have the Nios+SDRAM controller consuming 11-3 = 8 clocks. That's 8 clocks of overhead. Is this just to be expected as normal?

How much overhead would be added on say a Coldfire or an ARM or some other softcore? Do all/most embedded processors add over 300% overhead to memory reads? I don't know for sure, but it doesn't sound right.

I'd like to establish this as either an oversight, a work in progress, or the way it is and then have it documented. The current literature promises either "single cycle" or ">1 clocks" to access sdram. (11 != 1)

Thanks,

Ken

Altera_Forum · ‎12-22-2004

Ken,

I just sat down with James (who posts here sometimes) and we went over the numbers. The >= 1 clock in the documentation refers to all loads. A load that is a cache hit takes 1 clock. Everything else pays a penalty. Here is a rough break-down of the overall latency:

- ld instruction occurs

- cache miss - tick

- prepare avalon read - tick

- avalon read signals asserted - tick

- wait for avalon. The fastest memory would have data back on this clock. A random SDRAM access takes 5, as evidenced by previous discussion - 5 ticks

- register incoming data - tick

- align (this is because its possible that the user wanted an 8 or 16 bit load) - tick

- instructions immediately following that need the load data? another 2 ticks (this is seldom the case)

As you can see it pays to have something cached! A couple of the above clocks that you pay are a result of Nios II being optimized for f-max -- it makes sense to run it as fast as possible. One note: if your main performance bottleneck is loading this data (which cannot be cached), and you're changing your board run from faster memory, it may make sense to try the /s core. The reason is that you'll save 1 or 2 cycles per load as the "cache miss" and preparing the Avalon load penalties aren't there.

Also, I realize you're working with small data buffers but if they start to get larger (10, 20+ bytes perhaps) it would start making sense to do a quick DMA. By quick I mean setup an initial DMA and then do a few register writes to the peripheral directly to kick off a transfer. The basic things needed: start addr, stop addr, mode, transfer count.. I think a couple of these retain their values so it may be possible to start a DMA with 2-3 IO writes (this is part of that promised write-up -- all I'll do if you want to pre-empt me is look at the DMA datasheet on that one). The DMA controller will get one word of data per clock out of SDRAM after the initial penalty. I'd like to get into this more now but I have to catch a flight this afternoon. Happy holidays.

Altera_Forum · ‎12-26-2004

Hi Ken,

--- Quote Start ---

originally posted by kenland@Dec 22 2004, 01:47 PM

hi thomas,

in your analysis you have the nios+sdram controller consuming 11-3 = 8 clocks. that's 8 clocks of overhead. is this just to be expected as normal?

how much overhead would be added on say a coldfire or an arm or some other softcore? do all/most embedded processors add over 300% overhead to memory reads? i don't know for sure, but it doesn't sound right.

i'd like to establish this as either an oversight, a work in progress, or the way it is and then have it documented. the current literature promises either "single cycle" or ">1 clocks" to access sdram. (11 != 1)

thanks,

ken

--- Quote End ---

I do not think that a "real" processor would be that slow, but there would for sure be some clocks of delay too, when you have really RANDOM accesses. The Nios needs some cycles more, as it is very pipelined to achieve a high fmax (as Jesse / James pointed out).

This problem (that memory access that miss that cache add a large delay) is basically the reason why Intel added Hyperthreading to their Pentium 4: While the Pentium 4 is waiting for the data, he "simply" switches to another task, so it can do something useful during waiting. (Of course the Pentium 4 is a much more sophisticated architecture with out-of-order execution and such stuff, and a cache miss is there even a larger penalty, because the core operates at a much higher frequency (e.g. 3.6 GHz) then the memory (e.g. 400MHz), so you get easily delays in the range of about 50 to 100 CPU-clock-cycles).

Merry Christmas

Thomas

www.entner-electronics.com (http://www.entner-electronics.com)

Altera_Forum · ‎12-26-2004

Hi,

Original I choose Nios (the first one), because the X brand had a processor where it took some 7 to 8 cycles to read some data from external memory (even fast SRAM).

Nice to hear that the second version of NIOS is going in the same direction ;-)

A good start in 2005 for everyone!

Stefaan

Altera_Forum · ‎12-28-2004

Jesse,

Thank you for this incredibly valuable info. Please Please add this info to the documenation. I've already respun my board with a Stratix over this plus the bit shifting problem (1 clock per bit without hardware multiplier!) I'd hate to see this happen to someone else!

(at least say 1 clock if cached 7+ if not)

Based on this new info I'm not sure if the Stratix will help enough. I don't know exactly what the read overhead on our Coldfire is, but I suspect it is much less than 7 clocks minimum for non-cached reads. Data caching is really of little use for many embedded applications that are always streaming or otherwise processing only new information. (music, video, scanning, almost anything...) In fact what is typically interesting about revisiting the same old data?

I wonder if there is a way to dma into the data cache to get work packets to near 2 clocks? (overhead + one clock for dma + one clock for actual read) Actually, I'm surprised the existing cache controller doesn't assume read ahead and do this already.

I hope Altera will see how crucial fast memory access is. The good news is that anything that can be done will improve performance by 10%+ for each clock eliminated!

Thanks,

Ken

Altera_Forum · ‎12-28-2004

Hi Ken,

in most C-programms, most data-transfers are from/to the stack, where a data-stack helps much.

If you have a stream of data to process, maybe you can control the streaming yourself and read the next word always from the same address (not cached and also not SDRAM, of course). You could also implement custom instructions to access the stream, that would be even faster. Of course it is a pity, that the data-cache/SDRAM-controller does not read a line of data (then you would have to cache-miss-penalty only at the first of the words in a cache-line). You could also do a DMA-transfer to a internal SRAM-block and use this as "cache" for further processing.

You also mentioned a 16bit LUT, I think: If it is a steady transfer-function, it may be possible to reduce its size (to lets say 256 points) and interpolate in between (in FPGA-"hardware"). Then you can put the LUT into a internal SRAM-block. If you implement this with e.g. custom instructions, you could achieve about 2 clocks per look-up. If you implement this in a clever way, the resource-usage (LCs and RAM-blocks) would not be too much, I think.

Regards,

Thomas