Is it possible to make the code faster via pipelining?

Altera_Forum · ‎08-16-2011

Hi there everyone...i'm trying to implement both TEA and XTEA algorithm to make a comparison. I have a complete working vhdl of TEA as following:Library ieee; use ieee.std_logic_1164.all; use ieee.std_logic_unsigned.all; use ieee.std_logic_unsigned.all; entity TEA_en is port( clock: in std_logic; --clock input input_data: in std_logic_vector (63 downto 0); --input data key : in std_logic_vector (127 downto 0); --secret key 127 downto 0-- encrypted_data: out std_logic_vector (63 downto 0) --output/encrypted data ); end entity TEA_en; architecture behave of TEA_en is --declare signals signal Key0, Key1, Key2, Key3 : std_logic_vector (31 downto 0); signal Z, Y : std_logic_vector (31 downto 0); signal count :integer :=0; begin --separate key into four parts Key0<=key(127 downto 96); Key1<=key(95 downto 64); Key2<=key(63 downto 32); Key3<=key(31 downto 0); Process(Input_data, clock) --declare and initialize variable Variable delta: std_logic_vector (31 downto 0):=x"9e3779b9"; Variable sum: std_logic_vector (31 downto 0):=x"00000000"; Variable Zeq,Yeq,Z,Y: std_logic_vector (31 downto 0); Begin If(rising_edge(clock)) then If (count<1) then --separate input data into two parts Z:=input_data(63 downto 32); --part 1 (32bits) Y:=input_data(31 downto 0); --part 2 (32bits) Else --null; End if; If (count<32) then --Encryption routine algorithms sum:=sum+delta; --Calculate Y Zeq:=( (Z(27 downto 0) & "0000")+Key0) xor --left shift 4 bits and sum to secret key1 (Z+ sum) xor --Z add to sum (("00000" & Z(31 downto 5))+Key1); --right shift 5 bits and sum to key2 Y:=Y+Zeq; --Calculate Z Yeq:=( (Y(27 downto 0) & "0000")+Key2) xor --left shift 4 bits and sum to secret key1 (Y+ sum) xor --Z add to sum (("00000" & Y(31 downto 5))+Key3); --right shift 5 bits and sum to key2 Z:=Z+Yeq; --Output encrypted data Encrypted_data<=Y&Z; else end if; count<=count+1; --increase value of count End if; End process; end architecture behave;

It can run at a clock period of 15ns using the code above while giving the correct results. Is there anyway to make this code faster whereby it can run at faster clock period? how can i implement pipelining or other methods for improvement in terms of speed operation? (i tried breaking up the zeq and yeq process by using signals to simulate pipelining but it does not give me the correct results) the code is:


signal pipeline_0,pipeline_1,pipeline_2,pipeline_3,pipeline_4,pipeline_5: std_logic_vector(31 downto 0);
pipeline_0<=((Z(27 downto 0) & "0000")+Key0);
pipeline_1<=(Z+ sum);
pipeline_2<=(("00000" &  Z(31 downto 5))+Key1);
pipeline_3<=( (Y(27 downto 0) & "0000")+Key2);
pipeline_4<=(Y+ sum);
pipeline_5<=(("00000" &  Y(31 downto 5))+Key3);
Zeq:= pipeline_0 xor pipeline_1 xor pipeline_2;
Yeq:= pipeline_3 xor pipeline_4 xor pipeline_5;

...any suggestions or is there anyway to improve resource usage and speed of operation or am i doing anything redundant?...any experts out there please do help! thanx in advance!

Altera_Forum · ‎08-16-2011

You need a major redesign if you want to achieve high speed.

You have long comb. path e.g. from input to Z to Zeq to Y to Yeq.

Your pipelining is ok but only partial.

The change of function(you call it wrong results) need to be balanced after inserting pipe.

Other notes:

your counter seems unconstrained going up from 0 and up;

your + operator is applied to std_logic_vector (how was it accepted by compiler)

Altera_Forum · ‎08-16-2011

hi kaz, thank you so much for the comment! appreciate it...you mentioned that i have to balance out pipeline? sorry but i don't quite understand how i can do so? i'll redesigned the long combinational path part as you mentioned.

on the other hand, will constraining the counter help achieve faster speed? Also, it seems like the code works when i use "use ieee.std_logic_unsigned.all;" but not when i use "use ieee.numeric_std.all". when i use the later package the following error will appear:

Error (10327): VHDL error at TEA_en.vhd(46): can't determine definition of operator ""+"" -- found 0 possible definitions

you also mentioned that i only have partial pipelinng. how do i achieve a full pipelining? The system block diagram (in fact i used the c code in the following lin to model my system) can be seen here:

http://en.wikipedia.org/wiki/tiny_encryption_algorithm

Again, thank you for your ideas and comments! they are of great help!:)

Altera_Forum · ‎08-16-2011

First the counter need be constrained otherwise it defaults to some 40 or so bits (integer limit) and wouldn't come back to zero until it goes that far and that is not what you want.

For pipeline balance you need to match the delay caused by each register so you add or xor data ...etc. as originally designed but with delay. For example

if A = A1+A2 and

C = A + B

then if you delay A you should delay B equally so that A matches B to get correct C.

I said your pieline is incomplete because the path from input to Z to Zeq to Y is all combinatorial. You only pipelined the computation of Z.

May be the best way for you is to use signal instead of variable as then it will force a pipeper assignment and all you need is balance the delays.

Altera_Forum · ‎08-16-2011

Hi kaz! thank you for your prompt reply! and by constraining the counter as you suggested, the throughput is now 14ns! improved by 1 ns!you're a genius! hahaha:) as for the delay balance...does quartus support this? how can i balance the delay when i don't know the delay of the components? does quartus support "after 3ns" command? anyway i can implementing the delay balancing via coding that you know of? thanx for suggesting the use of signals...i'm looking into it but thanx again for all the advice!

Altera_Forum · ‎08-16-2011

I don't expect counter restrain to improve speed much as the main bottleneck is your long comb path.

By delay I mean one clock period per register so you don't need quartus to tell you. If you register a node then it is updated one clock later if it is a signal. But if it is a variable then it is updated without clock delay.

Use of variable is a bit tricky, normally it implies no register i.e. comb. section within clocked process. But there is one exception and that is if variable is read before its assignment, in this case compiler understands it as you want to keep value at end of process and so creates a register and thus becomes equivalent to signal...rather confusing.

Altera_Forum · ‎08-16-2011

hi again kaz, thanx for your previous reply...i tried to pipeline the bottleneck using the following code, but it doesn't give me the correct results even after i perform pipeline balancing:

Library ieee;
use ieee.std_logic_1164.all;
use ieee.std_logic_unsigned.all;
entity TEA_en is 
port(
    clock: in std_logic;                          --clock input
    input_data: in std_logic_vector (63 downto 0);    --input data
    key : in std_logic_vector (127 downto 0);    --secret key 127 downto 0--
    encrypted_data: out std_logic_vector (63 downto 0)    --output/encrypted data
);
end entity TEA_en;
architecture behave of TEA_en is
--declare signals
signal Key0, Key1, Key2, Key3 : std_logic_vector (31 downto 0);
signal Z, Y : std_logic_vector (31 downto 0);
signal count :integer range 0 to 32:=0;            --constrained counter (counts from 0 to 32) improves throughput by 1 ns!
signal pipeline_0,pipeline_1,pipeline_2,pipeline_3,pipeline_4,pipeline_5,pipeline_6,pipeline_7:std_logic_vector(31 downto 0);
begin
--separate key into four parts
Key0<=key(127 downto 96);
Key1<=key(95 downto 64);
Key2<=key(63 downto 32);
Key3<=key(31 downto 0);
Process(Input_data, clock)
--declare and initialize variable
    Variable delta: std_logic_vector (31 downto 0):=x"9e3779b9";
    Variable sum: std_logic_vector (31 downto 0):=x"00000000";
    Variable Zeq,Yeq,Z,Y: std_logic_vector (31 downto 0);
    Begin
        If(rising_edge(clock)) then
            If (count<1) then            --separate input data into two parts
            Z:=input_data(63 downto 32);    --part 1 (32bits)
            Y:=input_data(31 downto 0);        --part 2 (32bits)
            Else --null;
            End if;
            If (count<32) then
            --Encryption routine algorithms
            sum:=sum+delta;
            
            for i in 1 to 8 loop
                case i is
                when 1=>        pipeline_0    <=( (Z(27 downto 0) & "0000")+Key0);        
                when 2=>        pipeline_1    <=(Z+ sum)    ;
                when 3=>        pipeline_2    <=(("00000" &  Z(31 downto 5))+Key1);
                when 4=>        pipeline_3    <=( (Y(27 downto 0) & "0000")+Key2)    ;
                when 5=>        pipeline_4    <=(Y+ sum)    ;
                when 6=>        pipeline_5    <=(("00000" &  Y(31 downto 5))+Key3)    ;    
                when 7=>        pipeline_6    <=pipeline_0 xor pipeline_1 xor pipeline_2;
                when 8=>        pipeline_7    <=pipeline_3 xor pipeline_4 xor pipeline_5;
                when others=>     null;
                end case;                        
            end loop;
        
            
            
            Y:=Y+pipeline_6;
            Z:=Z+pipeline_7;
            --Output encrypted data
            Encrypted_data<=Y&Z;
            else null;
            end if;
        count<=count+1;    --increase value of count     
        End if;
        
    End process;
end architecture behave;

anything i did wrong in the process? many thanx again for the help!^^

Altera_Forum · ‎08-16-2011

This is where a good testbench and time to sitdown debugging your code will help.

Altera_Forum · ‎08-16-2011

Indeed as Tricky said you need time and testing to finalise.

But I note your counter though constrained in declaration but not in actual logic, you must define at what value your counter returns back in the logic of counter.

You don't need the loop statement as it is doing nothing and each assignment inside this loop is done once anyway.

May be you can do your testing this way:

instantiate your first working vhdl module and then the new one. Give them same inputs and check outputs until they are same with just delay being different.