Avalon MM Slave waitrequest signal modus-operandi

Altera_Forum · ‎11-20-2014

Greetings, everyone.

I'm trying to integrate a custom peripheral in the HPS system using an Avalon MM Slave and the Lightweight H2F bridge, but I'm having s little trouble with the sync of the reading operation.

To learn how the MM Slave works, I designed a simple module that does a sum of the squares of two float numbers. The input is written into 2 registers, then a start command is sent to a control register. The result is written into a result register. The operation costs 13 clocks to be complete, using MegaWizards FP units. Then, the result is read back. In software, it is translated into 3 write operations and 1 read operation. I used the following codelet to do so:

alt_write_word(PERIPHERAL_ADDRESS_A, a); //The address of the first input register
alt_write_word(PERIPHERAL_ADDRESS_B, b); //The address of the second input register
alt_write_word(PERIPHERAL_ADDRESS_C, 1); //The address of the control register
alt_read_word(PERIPHERAL_ADDRESS_D); //Reading the result register

My problem is with the waitrequest signal. If I understood correctly, if the master executes a read request while waitrequest is assigned, it holds the operation until it the signal is deasserted, then completes the request in the next clock cycle. Then, I designed the signal to be asserted until the result is saved in the result register. Thus, the software code above should give me the result, as the read operation would be hold until the 13 clocks have passed and the result register updated with the correct result, even with the clock speed difference between the HPS and the FPGA (800MHz and 200MHz, respectively).

But it not happens this way. When I run the code, I get the previous result, or 0, if it is the first run. It seems like the read operation is carried on before the peripheral is done with the calculations, even if the waitrequest signal is asserted. It was verified in ModelSim that the waitrequest signals works properly. I tested it with the following:

c = a * a + b * b;
alt_write_word(PERIPHERAL_ADDRESS_A, a); //The address of the first input register and the first value
alt_write_word(PERIPHERAL_ADDRESS_B, b); //The address of the second input register and the second value
alt_write_word(PERIPHERAL_ADDRESS_C, 1); //The address of the control register and the start command
counter = 0;
while(c != d)
{
    d = alt_read_word(PERIPHERAL_ADDRESS_D); //Reading the result register
    ++counter;
}

It always took 2 iterations to have the correct value returned. The question is, why it is happening this way? The problem is in the waitrequest signal or the alt_read_word macro? There are some cache process that I am unaware of? How to ensure I'm getting the correct result?

At first, I had thought to use a interrupt to return the processed data to the software, but It should not be necessary if the reading operation code is a blocking one. Should I use the interrupt then?

Just to be known, this is a just a test to learn how to design and integrate custom peripherals. The real one that I should develop will be more complicated, taking even thousands of clock cycles to complete.

Thank you and good work to everyone.

Altera_Forum · ‎12-03-2014

This sounds like a hardware issue, and if I had to guess it's probably the logic that captures the result. If you attach the RTL we can probably point out where the issue is.

Also if you plan on having eventually a fully pipelined block where you keep stuffing data into it and reading out the results then I recommend switching to what I call a "latency aware" slave port. Slave ports that are latency aware either have a fixed read latency or use the readdatavalid signal which allows a master to issue out multiple reads and the slave port responds to them as they come in. Using only waitrequest has the drawback that it prevents the master for issuing more reads while it's waiting for the read data from another transaction to return.

Altera_Forum · ‎12-04-2014

Thanks, BadOmen, for replying.

As requested, here follows the code for the Avalon Interface:


module sum_of_squares_inst
(
	input clk,
	input rst,
	input write,
	input read,
	input addr,
	input writedata,
	output reg waitrequest,
	output reg readdata
);
reg a, b;
wire r;
reg start;
wire done, clk_en;
sum_of_squares_control sosctrl
(
	.clk(clk),
	.clk_en(clk_en),
	.rst(rst),
	.start(start),
	.done(done)
);
sum_of_squares sos
(
	.clk(clk),
	.clk_en(clk_en),
	.a(a),
	.b(b),
	.r(r)
);
initial begin
	a <= 32'h00000000;
	b <= 32'h00000000;
	readdata <= 32'h00000000;
	start <= 0;
	waitrequest <= 0;
end
always @(posedge clk) begin
	if(rst) begin
		a <= 32'h00000000;
		b <= 32'h00000000;
		readdata <= 32'h00000000;
		start <= 0;
	end else begin
		if(done & !start) //If there is no process working, the module is read to be read or written
			waitrequest <= 0;
		if(write) begin
			case(addr)
				2'b00: begin // Write to the first register
					a <= writedata;
					start <= 0;
				end
				2'b01: begin // Write to the second register
					b <= writedata;
					start <= 0;
				end
				2'b10: begin // Write to the start register
					if(writedata) begin
						start <= 1; // One cycle wide control signal
						waitrequest <= 1; // Wait for the process to finish
					end else
						start <= 0;
					end
				default: begin
					a <= 32'h00000000;
					b <= 32'h00000000;
					start <= 0;
				end		
			endcase
		end else if(read) begin
			case(addr)
				2'b00: // Read the result
					readdata <= r;
				default:
					readdata <= readdata;
				endcase
		end else
			start <= 0;
	end
end
endmodule

The other two modules are very simple, than I'll not post their codes here. This is what them do:

sum_of_squares is the arithmethical unity. It does a fp multplication of the two registers values in parallel, then sums the results. Multiplication takes 5 clocks, sum takes 7, plus 1 to save in the result register, total latency 13 clocks, fixed.

sum_of_squares_control is a FSM that receives the start control signal then generates the clock_enabled signal and counts the 13 clocks necessary to get the expected result and save it in the result register. It also controls the state of the done signal, that indicates if the unity is working or is idle.

I can't generate the simulation right now, because I am running Linux now and modelsim is crashing since I upgraded Quartus to 14.0.2. Later, I'll run the Windows version and attach the simulation, but as I said before, the waitrequest signal rises and falls when it should be.

In the future, the final module will be much more complex, then a Streaming interface seems more fit. If you can indicate me some material or reference design to help me with this, I'll be grateful.

Thank you again for the help!

Regards,

Thank you again for the

Altera_Forum · ‎12-05-2014

The problem is most likely you are deasserting waitrequest too late. Since waitrequest is a register and you assign it 1'b1 in a clocked process statement that means during the command phase of the read waitrequest is still low and as a result the read is accepted, and then on the next clock edge your waitrequest becomes high. By the time it's high the read is over so that's causing you to get out of sync.

The fix should be as simple as something like this:

1) Rename "waitrequest" to busy (keep it a register though)

2) Create a new waitrequest but just make it an output wire instead of a register

3) Assign waitrequest like this: assign waitrequest = read | busy;

What this does is make sure you cover the cycle when the read is being issued.

That all said I wouldn't implement the hardware like this. What I would do instead is add a readdatavalid signal and just pipeline that with the result and not use waitrequest to throttle the transaction. That way if your hardware is fully pipelined you can have this logic handling multiple reads in flight.

Another alternative that should work if you want to stick to using waitrequest is to keep waitrequest high when there are no transactions in flight and only strobe it low when a result is ready. The Qsys fabric I'm told should not care if your component asserts waitrequest while your logic is being accessed so that would also get around the problem of your waitrequest transistioning too late. Still I would go with a more pipelined approach since that's a closer fit to a streaming approach where data is constantly poring in/out of the logic.

*just read your last post, if the hardware is always going to have a fixed latency then you can declare this in the .tcl file and all you have to do is make sure the result shows up on readdata 'x' number of cycles after the read is accepted and not bother backpressuring with waitrequest. This assumes your logic can handle back to back read request, if your logic really is fixed latency and pipelined then this should be the case.

Altera_Forum · ‎12-17-2014

Thks again, BadOmen. I am a little busy with other affairs, as soon as I'm finished I'll test your suggestions. It is the first time that I'm dealing with Avalon, the waitrequest approach seemed to be the way to achieve the kind of synchronization that I need, but it seems that your suggestion to pipeline the operations is even better. In fact, it should fit the requirements of the final module. May I ask you if you could show me some reference of how to do the pipelining you had suggest? The final goal should have variable latency, then it would be good for me to learn how to fine tune the transaction.

Best regards and happy holidays.

Altera_Forum · ‎12-17-2014

For fixed latency if you generate something like an on-chip RAM or PIO you'll find that they have fixed latency. They might lack a waitrequest line because they never backpressure though. Make sure in component editor to specify the latency if it's fixed so that Qsys knows how much buffering the fabric needs to support the readdata returning. If you component doesn't have a fixed latency then I recommend adding readdatavalid to make it a variable latency slave.

For variable latency I don't recall any simple ones and the only IP that comes to mind are the high speed memory controllers. Those would be too complicated to learn from so I recommend looking at the Avalon specification since they have timing diagrams showing how each slave type should react to an incoming Avalon transaction. With variable latency be sure to specify the maximum pending reads for your component when you feed it through component editor. This value means how many read transactions can be in flight in your IP before it starts issuing waitrequest, this value is used to let Qsys know how much buffering it should be putting into the fabric to handle the readdata that returns.