Megawizard FIFO problem: Write and read request filling FIFO!

Altera_Forum · ‎12-10-2007

Hi,

Well this has to be the most bizarre problem I've ever encountered. I've found that the read request signal to the fifo occasionally results in the fifo used counter to be incremented and the data output corrupted. The fifo works normally for what i'd say to be several thousand read/write operations before the problem occurs.

I've implemented a megawizard fifo with the following specs:

Size: 64 words

Width: 32-bits

Read Clock: 133Mhz

Write Clock: 33.25Mhz (derived from read clock)

Clock sync: 2-stage clock synchronization (for asynchronous read and write clocks)

FIFO mode: Legacy

Memory: M4K

Optimised for: Speed

No overflow protection circuitry has been disabled.

I've attached a jpg of the signal tap capture and also a zip file containing the vector wavefile (VWF).

I'm using TimeQuest analyser and Quartus 7.2. Using the multicorner analysis TimeQuest does not report any timing violations.

I've defined the two clocks as the following in my SDC:

create_clock -name sclk -period 7.519 -waveform { 0.000 3.759 }

create_generated_clock -name sclk_1_4 -source -divide_by 4

I chose the 2-stage synchronizer for the FIFO read/write operations as a precaution for any phase difference between the clocks that might cause a problem.

I can't think of anything else obvious. Any comments or suggestions would be appreciated.

Evan.

Altera_Forum · ‎12-10-2007

Bizarre indeed. Note that your read/write counters always increment, and the used words is just the difference between the two. Is it possible that, without protection circuitry, you've read past empty or written past full? This basically gets your counters to flip around each other, and so a read no longer moves the read counter toward your write counter(making the difference, rdusedwds get smaller), but it would be moving the read counter further away, making it get larger?

Can you make the two clocks outputs of a PLL? If so, they will be edge-aligned every third clock, and you no longer have to treat them as having asynchronous phases.

Altera_Forum · ‎12-10-2007

The vectors look like that when you get a FIFO to over / underflow. If that is the case you could see the full or empty pulse shortly before the capture window. Equality comparators are cheaper than LT/GT's, so once you overflow it tends to stay that way.

It is downright difficult to provoke a metastability failure at 133, so it's unlikely to be synchronization related.

Altera_Forum · ‎12-11-2007

Rysc/Gsynth,

The protection circuitry is still enabled, which should prevent any underrun/overflow situations and my PLLs are already both in use.

I'm starting to think its related to something else entirely, perhaps voltage or a manufacturing fault. I've checked the voltages and they seem normal.

If there is a timing problem and my constraints haven't detected it, could it really result in a request incrementing for more than 64 clock cycles? It just doesn't seem right.

Has anyone ever had a faulty FPGA? What were the indications?

Thanks for the suggestions.

Altera_Forum · ‎12-11-2007

Do you have the chance to let this run with other frequencies ? other fre-relations? Could you figure out a time pattern of the corrupted data ? Which device is it ?

Altera_Forum · ‎12-11-2007

Just out of curiosity, your first post says the overflow circuitry is not enabled, but your second post says it is. Can you tap out the actual counters in the FIFOs(rather than used words, which is a calculated based on the counters). That was we can tell whether the counters are in the wrong spot or the calculations of usedwds is off. In fact, I would grab as many registers as you can out of the FIFO and compare good rdreqs to bad rdreqs. I would be extremely suprised if this is a device issue:

- If you do a different place and route, does the issue still occur? This will implement completely different logic, routing, etc., just the IO will be the same.

- The fact that it works for a while and then goes off seems like a timing/functional issue.

- Timing issues would not result in 64 clock cycles. I'm not sure why you're asking that? I'm assuming your logic has correctly raised the rdreq flag(for however many cycles), and it's just that the rdusedwds is incrementing that looks incorrect. Is there something else wrong?

Altera_Forum · ‎12-11-2007

Ditto Rysc - it's not going to be the device hardware. You might want to spec the clocks to something very fast (e.g. 1GHz) to make sure there isn't some typo / rounding error where 133 divided by 4 turns into 32, etc.

If you send along your FIFO source file(s) I can double check the overflow logic is working properly.

Altera_Forum · ‎12-12-2007

Well this is a nice coincidence, but I just talked to someone else who was seeing something similar. They were doing constant writes(no reads), and their write counter which was incrementing suddenly started decrementing. What device are you targeting and what version of Quartus.

I looked at the design a little closer today and there are some other issues around it, i.e. lots of logic is getting synthesized out around the FIFO, which causes logic in the FIFO to get removed(it doesn't even use a memory block). So I think something else is causing it to act "funny", but we haven't narrowed it down yet. Does logic get removed from your FIFO or does it all seem to be intact? Does it use the right amount of memory blocks?

Altera_Forum · ‎12-12-2007

Thanks for your help. I have solved the problem.

Rysc,

The double negative was confusing "No overflow protection circuitry has been disabled." I meant the protection circuitry was enabled.

I was suggesting that the fact it remains broken for more than 64 cycles (the size of the fifo) it couldn't be a timing issue, as surely this would only result in a glitch of one or two cycles.

In the end it turned out the clock being fed into the FPGA (133Mhz) was terminated with an incorrect resistor value (22R). This resulted in the other FIFO clock (33Mhz), which was derived from the first being prone to glitches. I was able to trigger on this glitch on the scope, and it occurred at the same time the fifo started incrementing. What i can't understand is that the glitch only lasted for one clock cycle, yet the FIFO errors seem to last for multiple clock cycles (>10).

Thanks for all your suggestions.

Altera_Forum · ‎12-12-2007

Strange. A glitch can cause consecutive clock edges faster than the system runs(as well as other issues), causing things to jump to bad states. For example, there are gray-coded counters incrementing, and a glitch could easily cause them to jump to the incorrect state. That being said, it seems like the used words should jump to the wrong value but from there increment/decrement in the correct direction. I don't think there's any sort of state-machine, i.e. anything dependent on the previous state(besides the gray counters themselves). But glad to hear it's working and nice job finding it. That's the type of thing someone like me would spend forever in the part trying to capture/debug.

Altera_Forum · ‎12-13-2007

So I knew about gray counters about as much as everyone else, they follow a count sequence that only flips one bit at a time. So I made an asynch FIFO and created a simple simulation where I could see the gray code counters increment. Looking in the RTL view, there is a parity register, a count register and decode/encode logic to calculate the next state. The parity register basically toggles between 0 and 1, which makes sense. So I added an assignment to have the parity register power-up to 1 and reran the simulation.

Lo and behold, the counter began gray code counting backwards. Never knew they worked that way, but pretty interesting.

So... the parity bit is basically just inverting itself. If you had a glitch that caused it to have one extra toggle, then the parity bit will have an extra inversion, and the counter will count backwards.

(I think my case is the same thing, whereby the clock is glitching. If we move the clock to a PLL generated clock, it starts working.)

Anyway, I think you've already fixed it and knew the glitch was causing the problem, but it's nice to know what's going on at a more fundamental level. I'm just amazed this came up twice within 24 hours, and I'd never seen it for ten years prior.

Altera_Forum · ‎12-14-2007

Hey Rysc,

Now I can sleep at night! I was curious to as how something as seemingly robust as the FIFOs could malfunction in such a way. I'm grateful, as at the moment I didn't have the time to investigate (project overdue). Its reassuring to know that there was a logical explanation. I think I've learnt a valuable lesson and thats not to discount meta-stability for any problem, and the value of real scope!

Kind regards,

Evan.

Altera_Forum · ‎12-14-2007

Yeah, it was really bothering me, as I assumed the counters always counted up. Note that technically this isn't a metastability issue(where a register goes metastable), but a glitch(extra clock edge faster than the design will run at), causing the parity bit to get out of alignment with the count value. I was thinking it could be more robust without the parity value(and just do an XOR decode of the count value), which would make it always count up. This would make the design larger and slower, and the bottom line is that I've never seen a design that can handle clock glitches. So though the FIFO pointer might not count backwards, the counter would still jump to an incorrect value, state-machines would jump to bad states, etc. I often under-appreciate how dependent everything is on a good clock.