Let's Read hbird_e203 CPU Together

EleCannonic

Copyright Notice:

This article is licensed under CC BY-NC-SA 4.0.

Licensing Info:

Commercial use of this content is strictly prohibited. For more details on licensing policy, please visit the About page.

Note: This article is merely a thought process. Please do not use it as a handout, as it does not provide enough information for systematic learning. Using it as a reference is OK, though.

As a start to deep dive into the world of CPU,
hbird_e203 is a good project.

The following is the project link: Hummingbird E203

1. Instruction Set Architecture (ISA)

hbird_e203 applies RISC-V ISA,
which is designed simply.
Instrictions in RISC-V ISA are ordered strictly.

Format of RISC-V Instructions

RISC-V only supports little-endian.
What is little-endian and big-endian?
Well, let’s use a figure to interpret:

Interpretation of Little-Endian and Big-Endian

If the ISA applies little-endian,
the data stored is 0x78563412.
If big-endian,
it should be 0x12345678.

2. Standard DFF Registers

hbird_e203 uses modulized, standard DFF module to construct registers,
instead of using always block.

1
2
3
4
5
6
wire flg_r; // output signal
wire flg_nxt = ~flg_r; // input signal
wire flg_ena = (ptr == ('E203_OITF_DEPTH-1) & ptr_ena);

// instantiating standard DFF modules
sirv_gnrl_dfflr #(1) flg_dfflrs(flg_ena, flg_nxt, flg_r, clk, rst_n);

In another module, sirv_gnrl_dfflr is designed below:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
module sirv_gnrl_dfflr # (
parameter DW = 32 // This is a 32bit width DFF, by transmitting parameter DW can be vaired.
) (

input lden,
input [DW-1:0] dnxt,
output [DW-1:0] qout,

input clk, // clock
input rst_n // negative valid reset, synchronous
);

reg [DW-1:0] qout_r;

always @(posedge clk or negedge rst_n)
begin : DFFLR_PROC // DFFLR_PROC is just a label of this always block. No effect on functions.
if (rst_n == 1'b0)
qout_r <= {DW{1'b0}}; // reset, output pull down
else if (lden == 1'b1)
qout_r <= #1 dnxt;
// #1: delay of 1 time-unit. No effect in synthesis. Debug method in simulation.
// dnxt: data to be assigned
end

assign qout = qout_r;

// introduce a checker, conditional compile.
// only used in simulation, no hardware circuit produced
`ifndef FPGA_SOURCE//{
`ifndef DISABLE_SV_ASSERTION//{
//synopsys translate_off
sirv_gnrl_xchecker # (
.DW(1)
) sirv_gnrl_xchecker(
.i_dat(lden),
.clk (clk)
);
//synopsys translate_on
`endif//}
`endif//}

endmodule

Using standard modules,
it’s convenient to replace type of registers or insert delay globally.
The xchecker module captures undefined state.
Once detected, it reports an error and aborts simulation.

3. if-else and assign

This project recommends replacing if-else with assign.
Because if-else has two major drawback:

  • if-else cannot transmit undefined state X.

    1
    2
    3
    4
    if(flg)
    out = in1;
    else
    out = in2;

    If flg == X, verilog will equate this to flg == 0,
    the final output will be out = in2,
    which did not transmit X state.

    However, if assign used

    1
    assign out = flg ? in1 : in2;

    The X state will be transmitted.
    Such transmission will make debug easier.

  • if-else will be synthesised as a priority MUX,
    which brings large area and worse timing.
    Let’s take the MUX below for example:

    1
    2
    3
    4
    5
    6
    7
    8
    if (sel1)
    out = in1[3:0];
    else if (sel2)
    out = in2[3:0];
    else if (sel3)
    out = in3[3:0];
    else
    out = 4'b0;

    After synthesis, this code turns to

    Priority MUX

3 MUX will obviously occupy larger area.
But if we use assign:

1
2
3
assign out =   ({4{sel1}} & in1[3:0])
| ({4{sel2}} & in2[3:0])
| ({4{sel3}} & in3[3:0]);

This is a parallel, gating MUX.
The sel signals act as gating controls,
which are parallel for the three in signals.
It will be synthesised as

Parallel MUX

4. Data Hazard

  • RAW (Read After Write)
    Suppose instruction j needs an operation number,
    which should be provided by instruction i.
    Hence, WB of i must be executed before the register reading of j.

    For example:

    1
    2
    i: ADD x1, x2, x3 ; (x2 + x3 -> x1)
    j: SUB x4, x1, x5 ; (x1 - x5 -> x4)

    In a pipeline,
    when j is executing ID,
    i might be still executing EX,
    the result hasn’t been written into the register file yet.
    In this situation, j will read a wrong operation number.

    To solve the problem,
    the pipeline can apply a stalling to pause the following instructions,
    waiting for WB of i.
    But the most usual method is Data Forwarding.
    CPU will send the result of EX or MEM of i to j directly,
    instead of waiting for i.
    This method increases efficiency compared to stalling.

  • WAR (Write After Read)
    Instruction j tries to write in a register,
    but another instruction i needs to read the operation number in this register.
    Reading of i must be completed before writing of j.

    Example:

    1
    2
    3
    i: SUB x4, x1, x5  ; read x1
    j: ADD x1, x2, x3 ; write x1
    k: MUL x6, x1, x7 ; read x1

    If the pipeline is in-order,
    it has no problem.
    However, in an out-of-order pipeline,
    if x2 and x3 are ready ahead of time,
    j may finish executing before i.
    Then i will give a wrong result.

    To solve, CPU will rename the registers.

    1
    2
    3
    i: SUB x4, P1, x5   ; // P1 for old x1 value
    j: ADD P2, x2, x3 ; // P2 for new x1 value, unrelated with P1
    k: MUL x6, P2, x7 ; // Use new value

    To achieve renaming,
    the CPU will create a map from the external register file (ISA registers) to the internal registers.
    Then the writing and reading will not influence each other anymore.

  • WAW (Write After Write)
    Two instructions, i and j,
    both needs to write a number into the same register.
    The correct order is i first and j second.
    The same, WAW also occurs in an out-of-order pipeline.
    If j finishes first,
    the final result should be that of i, which is wrong.

    The solution is also Renaming.

5. Instruction Fetch (IF)

The final target of IF is to be ‘fast’ and ‘continuously’.

ITCM

To make IF faster,
we need to make the read-delay of the memory smaller.
General memory may have a delay of dozens of clock period,
which is far from meeting our requirements.

In general, a modern CPU creates a small memory (dozens of KB)
used for store instructions,
which is physically close to the core.
This memory is called ITCM (Instriction Tightly Coupled Memory).

ITCM is not DDR or cache.
It is just a small memory with specific address.
The delay is predictable compared to cache.
Hence, in situations with high performance requirements,
engineers prefer to use ITCM.

Non-aligned Instructions

RISC-V support compressed instructions (C expansion).
CPU has to deal with a mixture of 32-bit and 16-bit instructions.
So how CPU know it is a 32-bit or a 16-bit instruction?

The least significant two bits of the opcode for a 32-bit RISC-V instruction must be 0b11.

CPU distinguishes the instructions according to the least two significant bits (call it LS2B below).
If the LS2B is 0b11, it is 32bit;
If not, it is 16bit.

So how CPU deal with them?
Let’s clarify the flow in detail.

  • Components

    • Fetch Width: For efficiency, CPU will take more than a halfword from ITCM a time.
      It usually takes more, for example 32bits.
    • Instruction Prefetch Queue (IPQ):
      A FIFO between IFU and the decoder.
    • RISC-V Rule:
      If LS2B = 0b11, it is a 32bit instruction; or it is a 16bit one
  • Working Flow

    • According to PC value, IFU takes a word (32bit) from ITCM and inserts them into the bottom of IPQ.

    • ID get a halfword (16bit) from the top of IPQ, then judge whether it is a compressed instruction.

      • Situation A: It is a 16bit compressed instruction

        ID consumes the first 16bits in IPQ and sends them to the following sections as a entire instruction.
        Pointer of IPQ moves 2 bytes.

      • Situation B: It is part of a 32bit instruction

        ID needs more data. It consumes the first 32bits in IPQ and then sends them to the following section.
        Pointer of IPQ moves 4 bytes.

    • These steps will be repeated.
      When the data in the IPQ is less than 32 bits,
      the IFU will perform the next 32-bit read operation and pad the data to the end of the IPQ.

Branch Instructions

There are two types of branch instructions in RISC-V.

  • Unconditional Jump: Judgement conditions unnecessary.
    There are also two types of unconditional jump。

    • Direct: Target address can be directly calculated by imm in the instruction.

      Example: jal x5, imm, imm is 20 bit, jump to address 2*imm + PC.

    • Indirect: Target address needs to be calculated from data in the register file.

      Example: jalr x1, x6, imm, imm is 12 bit, jump to address imm + x6.

  • Conditional Jump: Jump with conditions
    Still, two types: Direct and Indirect. But there are no indirect instructions in RISC-V.

Branch Prediction

Solve two problems:

  • Whether to jump (Direction)
  • What’s the target address (Address)

Static prediction: Always predicting the same outcome or following a fixed pattern. (BTFN)

Jump direction: Target PC < Present PC, called back; otherwise called forward.

Dynamic:

  • 1 bit Saturating: Using the last direction to predict.
    Modify when meeting a mistake.

  • 2 bit Saturating:

    Look at the state machine:

2 bit Saturating Counter FSM Chart

2 bit Saturating is effective in predicting one instruction.
But for many instructions (At different PC address) not.
(They will conflict)
Ideally each jump instruction should have its own predictor,
which will bring unacceptable hardware cost.
So in practice there’re only finite predictors forming a table (Branch Prediction Table).

Accurate prediction procedure: Indexing

  • One instruction enters the pipeline, with PC = 0x12345678.
  • CPU take lower significant sevral bits (10 for example), index 0x678 = 0d1656.
  • CPU accesses BPT with index 0d1656, find a 2 bit saturating predictor.
  • Prediction run by predictor, update states, …

In fact the instructions number is far larger than that of predictors.
Hence many different instructions have to use the same predictor.
This problem is called Aliasing.

There is a more complicated method, but with better performance,
called Correlation-Based Branch Predictor.

  • Why we need it

Consider a code

1
2
3
4
5
6
7
8
9
if (a > 10) { // branch A
Hajimi nameiluduo axigaaxi;
}

Dingdongji Dingdongji, Dagou Dagou Jiaojiaojiao;

if (b > 20 && a > 10) { // branch B
Axiga Yakunalu Hajimi Haji;
}

Whether B jumps dependes on both b > 20 and the result of branch A.
If A did not jump, B must not jump.
A single predictor table cannot deal with such a situation.

Two components:

  • Global History Register (GHR): Width N, record results of recent N instructions.
  • Pattern History Table (PHT): An array composed by 2bit counters.

Index method: PC ^ GHR.
The 2bit counters record “when global history is in certain pattern, how branch B will act”,
instead of history of branch B itself.

Procedure:

Suppose GHR has 2bit width. Initial state 00.

  • Execution 1: suppose a = 5, A not jump, recorded as 0, GHR left shift, fill 0 to the LSB of GHR.
    GHR = 00.
    B not jump, recorded as 0, GHR left shift, fill 0 to the LSB of GHR.

  • Execution 2: suppose a = 15, b = 25, A jump, GHR = 01.
    Before execute B, produce index, idx = Hash(PC_B, 01).
    Find a 2bit counter (suppose initial state to be 11,
    indicates when the last branch jumped, B tends not to jump).

    Make prediction: B will not jump.

    Actual result: Prediction failed!
    Counter: 11 -> 10.
    GHR: 01 -> 11.

These steps repeat.

6. E200 IFU Implementation

RISC-V put the length indicator on the lowest significant bits.
Hence, the IF logic can recognize the length right after it fetches the lowest bits.
What’s more,
since the compressed instruction set is optional,
if the CPU is not designed to support compressed set,
it can directly ignore the lowest bits,
which can save about 6.25 I-cache cost.

Overall Design Concept

IFU module has such a micro-architechture:

IFU Micro-Architechture

It tries to fetch instructions “fast” and “continuously”.
E203 assumes most instructions are stored in ITCM,
for it is designed for ultra-low power, embedded cases,
it will never load long codes.
Usually the codes can all be loaded in ITCM.

IF module can fetch an instruction in just one period,
which has reached the requirement of fast.
When it need to fetch instructions from BIU,
there will be more delay but such cases are much less than ITCM,
so E203 made no optimization for these cases
(for higher performance, maybe such optimization is necessary, though).

For “continuously”,
each time IF should predict the next PC value.
IF partly decodes the fetched instruction and judges whether it needs to jump.
If yes,
branch predictor runs in the same period,
and IF uses the result and decoded information to generate next PC.

Mini-decode

This module just needs to judge whether it is a general instruction or branch instruction.
To simplify design process,
this module is implemented by instantiating a complete decode module
with unrelated input grounded and output unconneted.
Synthesis tools optimize the redundant logics and finally achieve a mini-decode.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
`include "e203_defines.v"

module e203_ifu_minidec(

// The IR stage to Decoder
input [`E203_INSTR_SIZE-1:0] instr,

// The Decoded Info-Bus
output dec_rs1en,
output dec_rs2en,
output [`E203_RFIDX_WIDTH-1:0] dec_rs1idx,
output [`E203_RFIDX_WIDTH-1:0] dec_rs2idx,

output dec_mulhsu,
output dec_mul ,
output dec_div ,
output dec_rem ,
output dec_divu ,
output dec_remu ,

output dec_rv32, // indicate bits of instruction (16bits or 32bits)
output dec_bjp, // general or branch instructions
output dec_jal, // whether jal
output dec_jalr, // whether jalr
output dec_bxx, // whether conditional jump instructions
output [`E203_RFIDX_WIDTH-1:0] dec_jalr_rs1idx, // index of rs1 register for jalr
output [`E203_XLEN-1:0] dec_bjp_imm // imm of conditional jump

);

// a complete decode module
e203_exu_decode u_e203_exu_decode(

.i_instr(instr),
.i_pc(`E203_PC_SIZE'b0),
// ......
);

endmodule

We will research on the decode module in detail
in the following sections, not here.

Ready/Valid Handshake

Ready/valid handshake is a protocal to ensure correct transition of data
between two equipments.

Sketch of Handshake Protocal

The rules are straightforward:
data transfer only happens when both ready and valid are ‘1‘ during the same clock cycle.
The handshake is a stateless protocol.
Neither party needs to remember what happened in previous clock cycles
to determine if a data transfer occurs in the current cycle.
Furthermore, both parties must operate synchronously and read the control signals on the same clock edge.
Because of that, ready/valid isn’t appropriate for clock domain crossing (CDC).

Simple BPU Branch Predictor

To achieve low power, E203 applies the most simple stationary prediction.
For conditional direct jump instructions,
a backward jump is predicted to require a jump;
otherwise, it is predicted not to require a jump.
Meanwhile BPU generates next PC by PC + offset.

The file is in module e203_ifu_litebpu.v

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
`include "e203_defines.v"

module e203_ifu_litebpu(

// Current PC
input [`E203_PC_SIZE-1:0] pc,

// The mini-decoded info
input dec_jal, // whether jal
input dec_jalr, // whether jalr
input dec_bxx, // where conditional jump
input [`E203_XLEN-1:0] dec_bjp_imm, // imm of conditional jump, target address is PC + imm
input [`E203_RFIDX_WIDTH-1:0] dec_jalr_rs1idx, // index of rs1 in jalr

// The IR index and OITF status to be used for checking dependency
input oitf_empty, // whether OITF is empty
input ir_empty, // whether IR is empty
input ir_rs1en, // whether to use rs1 in IR
input jalr_rs1idx_cam_irrdidx, // whether index of rs1 matches that of IR
/*
If the instruction in IR and the next jalr use the same register,
jalr_rs1idx_cam_irrdidx = 1
This signal is used to prevent RAW hazard.
*/

// The add op to next-pc adder
output bpu_wait,
/*
If dependency detected, bpu_wait = 1,
pause the pipeline and wait until instruction of IR completes execution.
*/

output prdt_taken, // result, predict to jump -> high
output [`E203_PC_SIZE-1:0] prdt_pc_add_op1,
output [`E203_PC_SIZE-1:0] prdt_pc_add_op2,
// predicted_pc = prdt_pc_add_op1 + prdt_pc_add_op2

input dec_i_valid, // valid for decode

// The RS1 to read regfile
output bpu2rf_rs1_ena, // enable register file to read rs1
input ir_valid_clr, // clear IR valid
// when instruction in IR is cleared, dependency can be removed.

input [`E203_XLEN-1:0] rf2bpu_x1, // read x1 register ()
input [`E203_XLEN-1:0] rf2bpu_rs1, // other registers for jalr

// clock and reset (negative valid)
input clk,
input rst_n
);

// The JAL and JALR is always jump, bxxx backward is predicted as taken
assign prdt_taken = (dec_jal | dec_jalr | (dec_bxx & dec_bjp_imm[`E203_XLEN-1]));
/*
`E203_XLEN is the width of register
BPU predicts to jump, if instr is jal or jalr.
For conditional, MSB of dec_bjp_imm determines the symbol of offset.
1 for negative number, backward jump, predict as taken;
0 for positive number, forward jump, predict as not-taken.
*/

// The JALR with rs1 == x1 have dependency or xN have dependency
wire dec_jalr_rs1x0 = (dec_jalr_rs1idx == `E203_RFIDX_WIDTH'd0);
// assigned as 1 when jalr uses x0 (value is always 0), a special unconditional jump

wire dec_jalr_rs1x1 = (dec_jalr_rs1idx == `E203_RFIDX_WIDTH'd1);
// assigned as 1 when jalr uses x1 (used as return address register)

wire dec_jalr_rs1xn = (~dec_jalr_rs1x0) & (~dec_jalr_rs1x1);
// assigned as 1 when jalr uses other registers (general registers)

// check dependency
wire jalr_rs1x1_dep = dec_i_valid & dec_jalr & dec_jalr_rs1x1 & ((~oitf_empty) | (jalr_rs1idx_cam_irrdidx));
/*
dependency occured at register x1:
decode success, is a jalr instr., jalr and IR both use x1, OITF is not empty.
*/
wire jalr_rs1xn_dep = dec_i_valid & dec_jalr & dec_jalr_rs1xn & ((~oitf_empty) | (~ir_empty));
/*
If the value of xn can only be determined until EXU, it is considered dependency exists.
It is considered to have dependency only if IR is not empty, instead of comparing the register number.
This is in fact a conservative method.
Such design simplifies hardware logic and reduces the cost of area and power,
though decreases performance at the same time.
But since E203 is designed for ultra-low power,
a little loss of performance is acceptable.
*/

wire jalr_rs1xn_dep_ir_clr = (jalr_rs1xn_dep & oitf_empty & (~ir_empty)) & (ir_valid_clr | (~ir_rs1en));
/*
If only depend to IR stage (OITF is empty), then if IR is under clearing,
or it does not use RS1 index, then we can also treat it as non-dependency
*/

// a FSM, determine the time to read value of rs1 to provide correct value for jalr
wire rs1xn_rdrf_r; // indicates whether CPU needs to read value from rs1
wire rs1xn_rdrf_set = (~rs1xn_rdrf_r) & dec_i_valid & dec_jalr & dec_jalr_rs1xn & ((~jalr_rs1xn_dep) | jalr_rs1xn_dep_ir_clr);
wire rs1xn_rdrf_clr = rs1xn_rdrf_r;
wire rs1xn_rdrf_ena = rs1xn_rdrf_set | rs1xn_rdrf_clr;
wire rs1xn_rdrf_nxt = rs1xn_rdrf_set | (~rs1xn_rdrf_clr);

sirv_gnrl_dfflr #(1) rs1xn_rdrf_dfflrs(rs1xn_rdrf_ena, rs1xn_rdrf_nxt, rs1xn_rdrf_r, clk, rst_n);

assign bpu2rf_rs1_ena = rs1xn_rdrf_set;

assign bpu_wait = jalr_rs1x1_dep | jalr_rs1xn_dep | rs1xn_rdrf_set;

// generate operation number 1
assign prdt_pc_add_op1 = (dec_bxx | dec_jal) ? pc[`E203_PC_SIZE-1:0]
// if jalr uses x0, use constant 0
: (dec_jalr & dec_jalr_rs1x0) ? `E203_PC_SIZE'b0
: (dec_jalr & dec_jalr_rs1x1) ? rf2bpu_x1[`E203_PC_SIZE-1:0]
: rf2bpu_rs1[`E203_PC_SIZE-1:0];

// generate operation number 2: offset by immediate number
assign prdt_pc_add_op2 = dec_bjp_imm[`E203_PC_SIZE-1:0];

endmodule

In RISC-V structure,
x1 is used as a “Return Address” by default.
In most cases,
jal and jalr will return the address of next instruction to x1 if not specially designated.
Hence, in most cases the address will be stored in x1.
To improve performance,
E203 implied a special acceleration to x1.

1
2
wire dec_jalr_rs1x1 = (dec_jalr_rs1idx == `E203_RFIDX_WIDTH'd1);
// assigned as 1 when jalr uses x1 (used as return address register)

This line is used to judge whether jalr used x1.
Moreover, it needs to judge whether there is RAW dependency.
RAW dependency exists when

  • OITF is not empty, meaning a long instruction is in execution. The result may be writen back to x1.
    (Of course it can use other registers, but conserve estimation is applied here to reduce area. The performance loss is ignored.)
  • Instruction in IR register will write the result back to x1.

So

1
wire jalr_rs1x1_dep = dec_i_valid & dec_jalr & dec_jalr_rs1x1 & ((~oitf_empty) | (jalr_rs1idx_cam_irrdidx));

is used to indicate dependency. The following line

1
assign bpu_wait = jalr_rs1x1_dep | jalr_rs1xn_dep | rs1xn_rdrf_set;

pulls up bpu_wait for one cycle if dependency is detected.
Such signal will stop next PC generation of IFU until RAW disappears.
In general, such a delay (stall) will cause a performance loss of one stall cycle.

If jalr uses other registers except x0 and x1,
E203 did not apply special acceleration.
To read xn, the 1st read port of Regfile is required.
Only when the port is empty xn can be read.
Meanwhile, IR must be empty to prevent RAW
(Similarly, ignore the performance loss).
If both RAW and read port are free,
pull up the enable of the port and occupy it

1
2
3
4
5
6
7
8
wire rs1xn_rdrf_set = (~rs1xn_rdrf_r) & dec_i_valid & dec_jalr & dec_jalr_rs1xn & ((~jalr_rs1xn_dep) | jalr_rs1xn_dep_ir_clr);
wire rs1xn_rdrf_clr = rs1xn_rdrf_r;
wire rs1xn_rdrf_ena = rs1xn_rdrf_set | rs1xn_rdrf_clr;
wire rs1xn_rdrf_nxt = rs1xn_rdrf_set | (~rs1xn_rdrf_clr);

sirv_gnrl_dfflr #(1) rs1xn_rdrf_dfflrs(rs1xn_rdrf_ena, rs1xn_rdrf_nxt, rs1xn_rdrf_r, clk, rst_n);

assign bpu2rf_rs1_ena = rs1xn_rdrf_set;

Access Memory

To improve code density,
E203 supports 16bit compressed RISC-V instruction sets.
Therefore, 32bit instructions mix with 16bit ones,
introducing misaligned 32bit instructions.

Misaligned 32bit Instructions

To deal with this,
E203 applies leftover buffer.
IFU fetches 32bits from the ITCM or BIU each time.
If IFU accesses ITCM,
since ITCM is composed of SRAM,
after reading, the value at the port will be hold-up (remains unchanged) until next reading.
Such property saves a 64bit register.

The bit-width of ITCM in E203 is 64bit.
One reading operation reads a 64bit data from the port, called a Lane.
When fetching by incrementing the address,
IFU fetches data from the same Lane multiple times since RISC-V instructions in E203 is at most 32bits long.
This reduces the times to read SRAM because IFU can read data from the hold-up port value until all data is read.

If a 32bit instruction crosses a 64bit boundary,
the leftover 16bit data will be stored in the leftover buffer
and raises a new access of SRAM.
The lowest 16bits of the new 64bit data from SRAM and the 16bits in the leftover buffer
will be concatenated into a complete 32bit instruction.
This is equivalent to fetching a 32bit instruction in one cycle,
without any performance loss.

If jump instruction or pipeline flush is encountered,
and the desired instruction crosses a 64bit boundary,
then two continuous SRAM readings are necessary.
This means the fetching must cost two cycles,
introducing one cycle loss.
E203 chooses to give up the optimization because it will bring too much extra area and consumption cost.


References:

[1] 胡振波, RISC-V架构与嵌入式开发快速入门, 第1版. 北京: 人民邮电出版社, 2019.
[2] 胡振波, 手把手教你设计CPU——RISC-V处理器, 第1版. 北京: 人民邮电出版社, 2018.

Comments