very strange QuartusII 7.1 synthesis Tool results

Altera_Forum · ‎08-07-2007

Hello everyone,

I have created a project using Quartus II 7.1 with the attached files. And the top level module name is TOP.

The files are APBM.v, APBS1.v, APBS2.v, APBS3.v, ARBITER.v, MuxS2M.v, TOP.v

The following line is from MuxS2M.v

when using

case ({PSELS3,PSELS2,PSELS1}) produces worst results ALUT's (1440)

when using

case ({PSELS1,PSELS3,PSELS2}) produces better results ALUT's (1424)

when using

case ({PSELS2,PSELS1,PSELS3}) produces best results ALUT's (977)

in principle APBS1, APBS2, APBS3 must use the same amount of resources, but by using the

above case statements, they use different amount of resources.

I have optimized the design for Area.

So my question is:

1) Am I missing something

2) Is it a bug from Quartus

3) I tried the same design in Xilinx ISE, for all of the cases, i get the same amount of LUT usage.

Eagerly waiting for some replies.

Regards,

Anil

Altera_Forum · ‎08-07-2007

How many LUTs did ISE use? Note that these are not the same statements in verilog, as the case statement is treated as a priority encoder, essentially a chain of if... else if... else if... statements. So if PSEL1 and PSEL3 are both active, then the first case and the second case will produce different results. Synthesis tools will try to determine if the cases are mutually exclusive, but that might not be possible if it's a one-hot signal that is decoded in another hierarchy. There are directives to get around this(parallel_case), which might help. Also look at the results and try to determine what was synthesized out and if it makes sense.

Altera_Forum · ‎08-07-2007

Hi,

Thanks for your reply. I understand what you say. But the point for me is that, according to my logic in the design only one PSELS1 or PSELS2 or PSELS3 or PSELD can be active. I am not really concerned about the logic. My only point is, since all the above select signals are mutually exclusive, any just by reordering the select signals, how can the tool synthesize in a different.

If you don't mind, could you just try to synthesize the design and see the results. It shouldn't take much time, the sysnthesis takes just under a minute of time.

some how, i feel the results are strange. To look into detail, i went and checked the synthesis results of the individual entities, this is where the true problem was realised. for each different ordering used by the (case statements) the LUT's used by modules APBS1, APBS2 and APBS3 was different, where as in general, all of them are same size and same logic except some initial value for registers in these modules.

Therefore, each of them must contain same amount of LUT's.

Regards,

Anil.

Altera_Forum · ‎08-07-2007

I haven't looked at your code closely enough to confirm this, but your posts sound like the situation where multiple instances of the exact same module are different sizes. It is normal for synthesis to implement instances of the same module differently. For example, some logic might be shared between instances.

If your APBS1, APBS2, and APBS3 are the same code and you want all instances to be identical, one solution is to synthesize the module separately and use bottom-up incremental compilation to instantiate that synthesis result for all instances.

Altera_Forum · ‎08-07-2007

Hi Brad,

Thanks for a quick reply. in my case, APBS1, APBS2 and APBS3 are different modules but with same logic inside. I understand that logic can be shared.

But how can you exaplain me this situation :

case (A)

logic.....

1) A = {a3,a2,a1}

2) A = {a1,a3,a2}

3) A = {a2,a1,a3}

where either a1, a2 or a3 can be high but not all or two at time being how.

I expect that all the three cases must produce the same results.

Regards,

Anil

Altera_Forum · ‎08-07-2007

Just based on the discussion in the thread without going through the code, I think what Rysc said might be the explanation. I made my previous post just in case your situation was similar to the multiple-instances-of-same-module situation.

Altera_Forum · ‎08-07-2007

Hmmm. This one's looking slightly devious. I took all the hierarchies and put them in a partition(preventing synthesis across boundaries and having every port exist). The three hierarchies were approximately 330LEs and were exactly 672 registers. Now, in your compiles they could get reduced down to as little as ~80 LEs, but always 672 registers. Two things that can occur are: reductions during synthesis and merging logic, i.e. they all decode the paddr so it's possible that some of that decode logic gets merged together, in which case it has to be put into one of the hierarchies, making that one seem artificually higher than the other two, when is what really is occuring is the other two are artificially smaller, as their logic is represented in another hierarchy.

But in your largest compiles, the APBS2 went up to 530 ALUTs. So it grew considerably larger. Looking around a little, I went into the report -> Fitter -> Resource Section -> Control Signals. (By the way, the Resource Section is a gold mine for info, so it's worthwhile to look around and get accustomed to what it shows you.) Anyway, each of the 3 repeated hierarchies use a lot of clock enables. What I'm sure is occuring is that the decode for a 32-bit bus is being done in a single LUT and then sent to the clock enable of a 32-bit register. So the 32-registers look like "lone registers", in that they don't have an accompanying LUT, but their clock enable is used. So the LUT resource goes down. In the one with 532 LUTs being used, APBS2 has hardly any clock enables. So in these cases, the LUT before it is being used.

As a note, there are only a few(usually two) clock enables per lab. What that means is if lots of clock enables are synthesized, then packing them into LABs becomes difficult. In third party synthesis and the early days of Quartus synthesis, I've seen them use way too many clock enables and the device runs out of labs when it doesn't seem full(this shouldn't be occuring any more). So there is some balancing on when to use the LUT and when to use the clock enable.

The algorithm that does this uses heuristics, which is slang for "I don't know what's going to happen." Or more exactly, over a suite of designs it is optimized to give the best results. But when running test cases, or on particular designs, it may not always do it right.

There's an assignment to force usage of the clock enable, but I don't think it will help your case as the clock enable can't be easily isolated to a single node(this assignment works best when you have a register that is used as a clock enable for a large cloud of logic).

You might want to file an SR to have it looked at(and maybe include this thread). Hopefully that's a point in the right direction.

Altera_Forum · ‎08-07-2007

Hi Rysc,

Thanks very much for the detailed explaination. I will look and try to use your ideas and get more in depth to know what is happening.

Regards,

Anil.

Altera_Forum · ‎08-07-2007

One last thing about what's going "heuristics". Maybe in the small test case, it realizes such a large percentage of the logic is using clock enables and it might have trouble packing them into labs, so it decides to use logic for this hierarchy, which seems logical. But once you put more logic into the design, it may switch back to using the clock enable. So it might be worthwhile to put this on the backburner for a while.

And in reality, I would try to get these into M512 RAMs. This will soak up all the flip-flops and all the addressing logic, and my guess is your design will reduce to ~6 M512s and a few hundred LUTs. The reason it won't go in now is that every flop/cell of the memory has a reset signal that can reset it to a specific value. This can't be implemented in embedded memory, although you can have an initialization file, so that it powers up to a specific state(you just can't reset into it). If you really, really need to be able to reset them, you might be able to have "shadow M512s" that hold those values, and on reset their values get dumped into your working RAMs. Just some ideas.

Altera_Forum · ‎08-14-2007

Take a look at the PSELx signals – they feed the register file read MUXes in APBS1..3, out to the “Mux2SM” data inputs. They also feed the MUX2Sm selects, generated from the Arbiter. The reconvergence makes them partially redundant. Notice that for example PRDATA3 won’t be visible to the outputs unless PSELS3 is high. PSELS3 (on port PSEL) is also used as an enable within APBS3.v - that can be minimized out as a don’t care.

As Brad and Rysc pointed out the data are given an unintended priority relationship by the Verilog case statement (even though they are actually one hot). It appears that the priority order change in “better”, “best”, “worst” is enough to jog the synthesis into recognizing more or fewer redundancies. It is probably hitting a depth limit in the minimization heuristics.

If you edit the arbiter like so …

wire common_sel = PSEL & (PADDR[15:10] == 6'b000000) /* synthesis keep */;

always@(*) begin

PSELS1 = common_sel & (PADDR[9:8] == 2'h1);

PSELS2 = common_sel & (PADDR[9:8] == 2'h2);

PSELS3 = common_sel & (PADDR[9:8] == 2'h3);

//PSELD is not used by the MUX in the current arrangement

PSELD = 1'b0;

end

The best and worst behaviors converge to the ‘best’ result in my Quartus. Fewer signals for minimization and clock enables to squabble over. For the long haul you probably want to edit out the extra read selects.

Altera_Forum · ‎08-14-2007

Hello gsynth,

Thanks for your reply. I have modified the files according to what you have said. I have attached the files also for your reference. Strangely, the previous worst results are giving best results. And the previous best results are giving worst than previous worst results. If you have some time, just look at the results.

I am unable to go into details because i don't have much ideas about this heuristic approach and so on. But never the less, one thing i want to know is, is there any fault in my thinking. I have thought in a very simple manner and expect those results the way i wanted. But for me these results look very stange because ultimately, the end result is same.

Regards

Anil.

Altera_Forum · ‎08-14-2007

Your files are the same on Quartus 7.2, I don't have a 7.1 anymore. It's going to be a little volatile, because the edit isn’t fixing the underlying redundancy. It just put the signals on a platter for synthesis to find.

To think about heuristic minimization – Heuristic means that the synthesis tool is going to take a quick look at the design for things to minimize. If it sees the redundancy you get a bonus. If it doesn’t you’re no worse off than you came in. The best to worst spread here is the difference between finding all the redundancy and none of it.

What’s the target application? You can probably get smaller and faster if some of these requirements can be tweaked. Tweak it enough and you can get into RAM, which sounds like the way to go personally.

Altera_Forum · ‎08-14-2007

Hi gsynth,

Actually it is not a real project. I was trying to learn some system verilog basics. So i built the small APB system. And then synthesized the design. later i translated to verilog, to see if system verilog was producing the same results.

In end effect, i saw that both are producing the same results.

never the less, i will try to use Quartus II 7.2 and see how i can figure out the issues. My only idea is to get more learning practice and learn some tricks to get better results.

Regards,

Anil

Altera_Forum · ‎08-14-2007

For your experiments / entertainment - attached reg_file.v

As it turns out it isn't easy to get synthesis to do just one clock enable for the whole register file. I think this mapping is the best use of the cells. You could turn the synchronous reset into a synchronous constant load for 'free'.

The unique clock enable per register word is OK on this small scale, but seems iffy to me as the chip fills.

With System Verilog you could turn those generate loops into arrays.