118 KiB
118 KiB
file:: Computer_Organization_and_Design_1681729306797_0.pdf file-path:: ../../../../assets/Computer_Organization_and_Design_1681729306797_0.pdf
-
Computer Abstractions and Technology
ls-type:: annotation hl-page:: 25 hl-color:: yellow id:: 643d2848-6edf-4c05-92c7-4a7de1b9cd22 - Classes of Computing Applications and Their Characteristics ls-type:: annotation hl-page:: 28 hl-color:: yellow id:: 643e2b9c-0bc2-4b02-b2b1-33e25539d5b9
- Below Your Program
ls-type:: annotation
hl-page:: 36
hl-color:: yellow
id:: 643ea1cf-af0e-45ba-97b3-376fd21ee1e3
collapsed:: true
- From a High-Level Language to the Language of Hardware ls-type:: annotation hl-page:: 37 hl-color:: yellow id:: 643ea1d7-cd7d-4e81-8d6f-c268aab04f68
- Under the Covers
ls-type:: annotation
hl-page:: 39
hl-color:: yellow
id:: 643ea295-e170-403a-a43d-71777bb41d9b
collapsed:: true
- The five classic components of a computer are input, output, memory, datapath, and control ls-type:: annotation hl-page:: 40 hl-color:: yellow id:: 643ea2f7-1fa7-4c12-ad2d-34a90d6968b7
- liquid crystal displays (LCDs)
hl-page:: 41
ls-type:: annotation
id:: 643ea91c-7643-4563-9341-f85096313a3b
hl-color:: yellow
- The LCD is not the source of light but instead controls the transmission of light. There is a background light source and the LCD has many rods which bend light to make it pass through. When applied with a current, the rod no more bends light thus controlling the pixel.
- an active matrix that has a tiny transistor switch at each pixel to precisely control current and make sharper images ls-type:: annotation hl-page:: 41 hl-color:: yellow id:: 643ead73-e1a6-4f10-82c5-2760a4ce839f
- instruction set architecture
hl-page:: 45
ls-type:: annotation
id:: 643eb029-9fe9-4013-a4a2-1365e195333b
hl-color:: yellow
- interface between the hardware and low-level software, distinguish architecture from implementation
- Technologies for Building Processors and Memory
ls-type:: annotation
hl-page:: 47
hl-color:: yellow
id:: 643eb311-6b10-4fa3-9aa3-dfd5a59acf2c
- Semiconductor, silicon: add materials to silicon that allow tiny areas to transform into one of three devices: Excellent conductor, Excellent insulator and Transistor (conduct/insulate at some conditions) hl-page:: 48 ls-type:: annotation id:: 643eb66d-0294-46ab-a182-20021b2495c5 hl-color:: yellow
- Silicon ingot sliced into Blank wafers, processed into Patterned wafers, and then Tested wafer, diced into Tested dies, bonded to package, finally Tested packaged dies
- die: Rectangular sections cut from a wafer (actually chip)
- yield: Percentage of good dies from the total dies on the wafer
- Performance
ls-type:: annotation
hl-page:: 51
hl-color:: yellow
id:: 643ec3be-027e-48b5-90e0-cd4a5e901691
- response/execution time: time between the start and completion of a task
hl-page:: 52
ls-type:: annotation
id:: 643fb234-d566-4913-a03b-c574e6a623c4
hl-color:: yellow
\text{Performance}_X = \frac{1}{\text{Execution time}_X}- Relative Performance: A is ==n times as fast as== B, which means the same program runs for 1/n time on A of that on B hl-page:: 54 ls-type:: annotation id:: 643fe045-80bb-4c47-b601-3fdc9175581d hl-color:: yellow
- throughput: the total amount of work done in a given time hl-page:: 53 ls-type:: annotation id:: 643fb242-9401-42fa-bddc-d93e571b6e99 hl-color:: yellow
- Measuring Performance
ls-type:: annotation
hl-page:: 55
hl-color:: yellow
id:: 6440d1fd-a2c1-4d4c-a493-3b5c7d91448e
- Elapsed time: total time to complete a task, including RAM access, IO and other overhead.
- CPU time: time that CPU spends on computing for this task and not includes IO or waiting for schedule
- user CPU time
- system CPU time: time that OS performing tasks on behalf of the program (syscall?)
- CPU Performance and Its Factors
ls-type:: annotation
hl-page:: 56
hl-color:: yellow
id:: 6440d4c6-9dc2-4712-9d25-6386325581e5
- clock cycles: discrete time intervals
- clock period: length of a clock cycle
- clock rate: inverse of the clock period
- For a specific program, CPU time = CPU clock cycles
\timesClock cycle time = CPU clock cycles\divClock rate
- clock cycles: discrete time intervals
- Instruction Performance
ls-type:: annotation
hl-page:: 58
hl-color:: yellow
id:: 6440d70f-cde0-41ca-af1b-82d5a777a7a8
- CPU clock cycles = Instruction count
\timesCPI - CPI (clock Cycles Per Instruction): average number of cycles each instruction takes to execute (for one program)
- compare two different implementations of the same ISA
- CPU clock cycles = Instruction count
- The Classic CPU Performance Equation
ls-type:: annotation
hl-page:: 59
hl-color:: yellow
id:: 6440d99c-e080-4352-8c2d-7d35e897a2ee
- CPU time = Instruction count
\timesCPI\divClock rate - The formulas separates 3 key factors affecting the performance
- The only complete and reliable measure of computer performance is time. hl-page:: 61 ls-type:: annotation id:: 6440dc59-a80f-4185-9692-8e4122cad4b4 hl-color:: yellow
- CPI depends on a wide variety of design details in the computer hl-page:: 61 ls-type:: annotation id:: 6440dcd2-ae75-4a59-a2c5-aff6e3bf7953 hl-color:: yellow
- CPU time = Instruction count
- response/execution time: time between the start and completion of a task
hl-page:: 52
ls-type:: annotation
id:: 643fb234-d566-4913-a03b-c574e6a623c4
hl-color:: yellow
- The Power Wall
ls-type:: annotation
hl-page:: 63
hl-color:: yellow
id:: 6440df39-4ac7-4efc-b59a-64db6354ca0f
collapsed:: true
- dynamic energy: The energy consumed when transistors switch states, primary source of energy consumption for CMOS.
- The energy of a single transition:
\text{Energy} \propto \frac12 \times \text{Capacitive load} \times \text{Voltage}^2 - The power required per transistor:
\text{Power} \propto \frac12 \times \text{Capacitive load} \times \text{Voltage}^2 \times \text{Frequency switched}- Frequency switched is a function of the clock rate
- Capacitive load is a function of fanout (number of transistors connected to an output) and the technology (capacitance of wires and transistors).
- Main way to reduce power is to lower the voltage.
- There is problem with low voltage: this makes the capacitor leakage increase. (static energy)
- The Sea Change: The Switch from Uniprocessors to Multiprocessors
ls-type:: annotation
hl-page:: 66
hl-color:: yellow
id:: 6440e9ca-16a4-48ed-9648-adafb74ce097
collapsed:: true
- This section is about the difficulty of parallel programming and relative materials.
- Fallacies and Pitfalls
ls-type:: annotation
hl-page:: 72
hl-color:: yellow
id:: 6440ea56-fdd5-426f-9a76-d5b5a7465c55
collapsed:: true
- Amdahl's Law: $\text{Execution time after improvement} = \frac{\text{Execution time affected by improvement} }{\text{Amount of improvement}} + \text{Execution time unaffected}$
hl-page:: 72
ls-type:: annotation
id:: 6441179d-ae59-49b4-8903-874cb5b7c9cd
hl-color:: yellow
- Thus, we CANNOT expect ==improvement of one aspect== of a computer to ==increase overall performance by an amount proportional== to the size of improvement.
- Computers at low utilization don't necessarily use little power, or in other words, power consumption is not proportional to the system's load. hl-page:: 73 ls-type:: annotation id:: 6441212c-596a-463c-a47a-04478b16268b hl-color:: yellow
- MIPS (million instructions per second) = $\frac{\text{Instruction count}}{\text{Execution time} \times 10^6} = \frac{\text{Clock rate}}{\text{CPI} \times 10^6}$
hl-page:: 74
ls-type:: annotation
id:: 64412369-dd7a-4a26-9979-be7179f38df6
hl-color:: yellow
- Problem 1: it doesn't take into account the Instruction count, or the capability of each instruction. We should not compare computers with different ISAs.
- Problem 2: MIPS varies between programs even on the same computer.
- Problem 3: MIPS can vary independently from performance.
- Amdahl's Law: $\text{Execution time after improvement} = \frac{\text{Execution time affected by improvement} }{\text{Amount of improvement}} + \text{Execution time unaffected}$
hl-page:: 72
ls-type:: annotation
id:: 6441179d-ae59-49b4-8903-874cb5b7c9cd
hl-color:: yellow
- Word List 1
collapsed:: true
- omnipresent 无所不在的 ubiquitous hl-page:: 27 ls-type:: annotation id:: 643e2b82-f5a5-411e-9571-d494858c175a hl-color:: green
- credo 信条,教义 ls-type:: annotation hl-page:: 30 hl-color:: green id:: 643e473a-2f03-419b-ad3a-8309c33dff15
- unraveling 解开;阐明; hl-page:: 31 ls-type:: annotation id:: 643e47b3-cc6c-4fd1-83a9-0510b16a5e9c hl-color:: green
- acronyms 首字母缩略词 ls-type:: annotation hl-page:: 32 hl-color:: green id:: 643e485f-8de8-41bf-86ac-812ba202f4c8
- leverage 影响力;杠杆作用 hl-page:: 33 ls-type:: annotation id:: 643e4871-3ebb-4578-9227-b40a534adeac hl-color:: green
- intrinsic 固有的, 内在的, 本质的 ls-type:: annotation hl-page:: 33 hl-color:: green id:: 643e4882-a5ea-4bff-9b5f-17f585313142
- weave 编织;杜撰 hl-page:: 34 ls-type:: annotation id:: 643e492d-5e63-4b9b-93f7-4f44bf50158e hl-color:: green
- rod 杆;竿;棒 ls-type:: annotation hl-page:: 41 hl-color:: green id:: 643ea931-ff7e-4bd7-96d1-b7eab4dcc563
- helix n. 螺旋 hl-page:: 41 ls-type:: annotation id:: 643ea93a-fa50-486a-b74a-d96f2a4df9aa hl-color:: green
- raster 光栅 ls-type:: annotation hl-page:: 41 hl-color:: green id:: 643ea8f8-7e5f-42e3-a04a-01cd91f25d13
- brawn 体力;发达的肌肉 ls-type:: annotation hl-page:: 42 hl-color:: green id:: 643eaede-f413-4717-9136-e28363909bb3
- quadruple 四倍的;四重的; hl-page:: 48 ls-type:: annotation id:: 643eb37e-2927-4100-b8b3-c76bbe5450f4 hl-color:: green
- slam 砰地关上(门或窗);抨击 hl-page:: 65 ls-type:: annotation id:: 6440e306-7625-4883-b3f0-fbdca42d92e3 hl-color:: green
- faucet 水龙头 ls-type:: annotation hl-page:: 65 hl-color:: green id:: 6440e5e4-e02f-43e3-9311-64f8b6d67f75
- unwieldy ls-type:: annotation hl-page:: 65 hl-color:: green id:: 6440e73c-8061-47b2-bc37-522db24f1707
- startling ls-type:: annotation hl-page:: 66 hl-color:: green id:: 6440e8c9-0024-4751-8233-43e8cea16699
- stiffer ls-type:: annotation hl-page:: 68 hl-color:: green id:: 6440e99d-0c7d-4acb-9ba7-9172e1d383d8
- ensnared ls-type:: annotation hl-page:: 72 hl-color:: green id:: 6440ec21-aef2-4494-8b50-d16a83c0d9bb
- corollary ls-type:: annotation hl-page:: 72 hl-color:: green id:: 6441170f-b51b-453a-b612-0b71b2b6032d
- demoralize ls-type:: annotation hl-page:: 72 hl-color:: green id:: 64411718-be78-469d-9b83-0dfd9c83338b
- plague ls-type:: annotation hl-page:: 72 hl-color:: green id:: 64411720-4c6a-49ce-8a72-4aa19e7b8482
- preclude ls-type:: annotation hl-page:: 75 hl-color:: green id:: 6440ebb6-a98d-465f-8345-3da49486f653
- constituent ls-type:: annotation hl-page:: 75 hl-color:: green id:: 6440ebc6-59b1-4b7e-ae67-75559989873b
- impeachable ls-type:: annotation hl-page:: 75 hl-color:: green id:: 6440ebd7-8c0c-4a18-9d86-bd95714f58ac
-
Instructions: Language of the Computer
ls-type:: annotation hl-page:: 83 hl-color:: yellow id:: 64412821-6b54-47a0-9317-a4b042989fdf - Operations of the Computer Hardware
ls-type:: annotation
hl-page:: 86
hl-color:: yellow
id:: 64412ca1-c9b5-4d6b-ba19-d353992dd2f1
collapsed:: true
- Three-operand arithmetic instructions
- Operands of the Computer Hardware
ls-type:: annotation
hl-page:: 89
hl-color:: yellow
id:: 64412cc0-59a8-4a38-9094-1a7bd916a41f
collapsed:: true
- Registers, where operands of arithmetic instructions must reside
- Register size is a word (32 bit)
- 32 registers in MIPS.
- fewer registers to keep clock cycles fast (though 31 regs may not be faster then 32 regs)
- instruction format (5-bit field for register number)
- data transfer instructions
ls-type:: annotation
hl-page:: 91
hl-color:: yellow
id:: 64412fc0-7ad1-4c8a-9c03-c198e741605b
- memory to register or inverse
- alignment restriction: words must start at addresses that are multiples of 4. As a result, there is some restrictions on the address for
lw/swhl-page:: 92 ls-type:: annotation id:: 6441425c-134d-449d-ae39-4db48a67054c hl-color:: yellow - memory is addressed by byte, remember this especially when dealing with array indices because the type of array elements decides the offset.
- MIPS is in the big-endian camp (though the textbook says so, the latest MIPS32 by default is little endian) hl-page:: 93 ls-type:: annotation id:: 64414940-7d06-4339-af0b-974b1b34dbc5 hl-color:: yellow
- Constant or Immediate Operands
ls-type:: annotation
hl-page:: 95
hl-color:: yellow
id:: 64414a51-2d38-48ff-b0c0-bc53f9c5fadb
- Constant operands occur frequently, and by ==including constants inside arithmetic instructions==, operations are much ==faster== and use ==less energy== than if constants were ==loaded from memory==. hl-page:: 95 ls-type:: annotation id:: 64414af2-05ea-4a88-baf6-a19462b4c3a9 hl-color:: yellow
- Since MIPS supports ==negative constants==, there is no need for subtract immediate in MIPS. ls-type:: annotation hl-page:: 96 hl-color:: yellow id:: 64414b4e-cf31-4e7f-8320-2f1bbcbf9b32
- Registers, where operands of arithmetic instructions must reside
- Signed and Unsigned Numbers
ls-type:: annotation
hl-page:: 96
hl-color:: yellow
id:: 64414b5f-de73-4bbc-812d-8ebd0f082ea0
collapsed:: true
- binary digits
hl-page:: 96
ls-type:: annotation
id:: 64414bb1-c764-493b-b555-4e241a31f255
hl-color:: yellow
- value of
ith digit:d \times \text{Base}^i - LSB and MSB
- Numbers have infinite number of digits, binary bit patterns are simply representatives of numbers. Thus, there are various ways of handling overflow. hl-page:: 97 ls-type:: annotation id:: 64414c34-4dc9-4127-9938-faf0374b6c29 hl-color:: yellow
- value of
- Signed numbers
- sign and magnitude: add a separate sign bit. Problems with this approach, need an extra step to set the sign during calculation, negative and positive zero hl-page:: 98 ls-type:: annotation id:: 64414d4d-71a4-44df-9475-d710bfee40d3 hl-color:: yellow
- two's compliment
- the value of this form can be written as
(d_{31} \cdot -2^{31}) + d_{30} \cdot 2^{30} + \dots, note the first $-2^{31}$ id:: 64414f1b-142c-4301-a5a2-6dc0ad3b102b
- the value of this form can be written as
- one's compliment: negate operation is to simply invert each bit
- sign extension: copy the sign repeatedly to fill the rest of the register when loading from memory
hl-page:: 99
ls-type:: annotation
id:: 64414fbb-4773-4332-bf21-f533847d0bde
hl-color:: yellow
- This trick works because positive 2's complement numbers really have an infinite number of 0s on the left and negative 2's complement numbers have an infinite number of 1s. The binary bit pattern representing a number hides leading bits to fit the width of the hardware; sign extension simply restores some of them. hl-page:: 101 ls-type:: annotation id:: 64415085-a9e5-4193-a5d1-9c90f5d63ea8 hl-color:: yellow
- binary digits
hl-page:: 96
ls-type:: annotation
id:: 64414bb1-c764-493b-b555-4e241a31f255
hl-color:: yellow
- Representing Instructions in the Computer
ls-type:: annotation
hl-page:: 103
hl-color:: yellow
id:: 64414d2f-3ea8-45a6-9a7a-b84f74a554cf
collapsed:: true
- MIPS Fields
ls-type:: annotation
hl-page:: 105
hl-color:: yellow
id:: 64415179-3df0-431e-9e53-8608796931dd
- In order to keep the instructions regular (aligned by word), MIPS has irregular layouts for different types of instruct.
- R-type:
op | rs | rt | rd | shamt | funct - I-type:
op | rs | rt | constant/address- The 16-bit address means a
lwcan only load from a region of\pm 2^{15}bytes of the base register. - here
rtserves as the destination register
- The 16-bit address means a
- MIPS Fields
ls-type:: annotation
hl-page:: 105
hl-color:: yellow
id:: 64415179-3df0-431e-9e53-8608796931dd
- Logical Operations
ls-type:: annotation
hl-page:: 110
hl-color:: yellow
id:: 64415118-595d-4125-b641-333d82a58006
collapsed:: true
sllandsrl, use theshamt(shift amount) fieldandioriextend their 16-bit constant field by filling 0s- there is no exact instruction for bitwise not, but a
nor(not or,a NOR b = NOT(a OR b)) instruction (perhaps in order to keep the 3-operand format)
- Instructions for Making Decisions
ls-type:: annotation
hl-page:: 113
hl-color:: yellow
id:: 644154b4-a07e-46fd-aa88-178297b61434
collapsed:: true
- conditional branches:
bneandbeqhl-page:: 113 ls-type:: annotation id:: 644156a7-b2b5-4010-97e3-a432f077cd33 hl-color:: yellow - Loops ls-type:: annotation hl-page:: 115 hl-color:: yellow id:: 64415778-92af-4d30-b3b4-0b3dddd397f4
sltandslti: ifrs < rt/rs < immthenrd=1elserd=0- MIPS assemblers use the combination
slt/sltiandbeq/bneand$zeroto create all relative conditions sltu/stliusigned and unsigned comparison are different, thus an unsigned version is provided
- MIPS assemblers use the combination
- Case/Switch Statement: jump address table and
jrinstruction (the runtime destination address is stored in register) hl-page:: 118 ls-type:: annotation id:: 644159a7-3b96-4924-8260-0cb300307c86 hl-color:: yellow
- conditional branches:
- Supporting Procedures in Computer Hardware
ls-type:: annotation
hl-page:: 119
hl-color:: yellow
id:: 644156f5-e485-4140-be11-6ef87a585383
collapsed:: true
jaljumps to an address and simultaneously saves the address of the following instruction in$rajrjumps to the address specified in a register- Calling convention for register:
$a0-$a3: four argument registers in which to pass parameters$v0–$v1: two value registers in which to return values$ra: one return address register to return to the point of origin$t0–$t9: temporary registers that are not preserved by the callee on a procedure call id:: 644163d9-8d4b-43e8-acec-57a835c4ce48$s0–$s7: saved registers that must be preserved on a procedure call (if used, callee saves and restores them)$sp: stack pointer to the most recently allocated address,pushsubstract from$spandpopadd to$sp$fp: frame pointer to the first word of the frame of a procedure$gp: pointer to global static data
- FIGURE 2.11 What is and what is not preserved across a procedure call. ls-type:: annotation hl-page:: 125 hl-color:: yellow id:: 64416615-2966-4adb-a524-845337e588d3
- Allocating Space for New Data on the Stack
ls-type:: annotation
hl-page:: 126
hl-color:: yellow
id:: 6441667f-2639-458b-91fd-8bf6b5a2c6ae
- stack is also used to store variables that are local to the procedure but do not fit in registers
- procedure frame or activation record ls-type:: annotation hl-page:: 126 hl-color:: yellow id:: 64416688-3f6e-444e-b265-3e4a36ec51b8
- a frame pointer offers a stable base register within a procedure for local memory-references, in that stack pointer changes during the procedure
- ASCII and String
hl-page:: 129
ls-type:: annotation
id:: 644167c0-1ac2-42df-b01d-4fff03a393e7
hl-color:: yellow
collapsed:: true
lb/lbuandsbload/store the right most byte,lh/lhuandshload/store the lower half word- Some notes about how to organize a string
- character size
- length or end mark
- MIPS Addressing for 32-bit Immediates and Addresses
ls-type:: annotation
hl-page:: 134
hl-color:: yellow
id:: 6441f1a7-4442-466f-9edc-776f6e0e6ecb
collapsed:: true
- 32-Bit Immediate Operands:
luiloads a half word to the upper 16 bits of a register, and then aorisets the lower 16 bits, thus loading a 32-bit immediate hl-page:: 135 ls-type:: annotation id:: 6441fb16-91e4-42d8-94bf-5718dd9fc91b hl-color:: yellow - Addressing in Branches and Jumps
ls-type:: annotation
hl-page:: 136
hl-color:: yellow
id:: 64427722-3a05-43f1-b00e-7dcc30cc74ff
- J-type instruction:
op | address (26 bits) - Since MIPS instructions are all 4-byte aligned, the unit of the address in PC-relative addressing is actually word. For example, 16-bit address in branch instruction actually represents an 18-bit address.
- J-type instruction:
- MIPS Addressing Mode
ls-type:: annotation
hl-page:: 139
hl-color:: yellow
id:: 64427a40-988f-4430-aaa5-84d3926c6234
-
- Immediate addressing: the operand is a constant within the instruction itself (e.g.,
addi $rd, $rs, 4) - Register addressing: the operand is a register (e.g.,
add $rd, $rs, $rt) - Base (displacement) addressing: the operand is at the memory location whose address is the sum of a register and a constant in the instruction (e.g.,
lw $rd, 4($rs)) - PC-relative addressing: the branch address is the sum of the PC and a constant in the instruction (e.g.,
beq $rs, $rt, #addr) - Pseudo-direct addressing: the jump address is the 26 bits of the instruction concatenated with the upper bits of the PC (e.g.,
j #addr)
- Immediate addressing: the operand is a constant within the instruction itself (e.g.,
-
- 32-Bit Immediate Operands:
- Parallelism and Instructions: Synchronization
ls-type:: annotation
hl-page:: 144
hl-color:: yellow
id:: 644284b0-dd95-46e5-829f-4509200e5f8d
collapsed:: true
- a set of hardware primitives with the ability to atomically read and modify a memory location hl-page:: 144 ls-type:: annotation id:: 64428547-07fb-49e2-9e3f-a7ac95b7e8e0 hl-color:: yellow
- atomic exchange: interchange a value in a register for a value in memory
- Introduces some challenges in the processor design
while (xchg(&lock, 1) == 1) ;
- MIPS
ll/sc: a pair of instructions in which the second instruction returns a value showing whether the pair of instructions as if one atomic instruction.while (ll(&lock) == 1 && sc(&lock, 1)) ;scwill fail after either another attempted store to thelled address or ==any exception==. It is possible to create deadlock wheresccan never complete due to repeated page faults.
- Translating and Starting a Program
ls-type:: annotation
hl-page:: 146
hl-color:: yellow
id:: 64428b6e-1dba-4bd4-ab91-d18ae9cb9cf6
collapsed:: true
- Assembler
ls-type:: annotation
hl-page:: 147
hl-color:: yellow
id:: 64428b72-2291-4380-a2ad-dc5cdc82315e
- pseudoinstructions: assembler translates these instructions into equivalent machine instructions. Register
$atis reserved for such translations. hl-page:: 147 ls-type:: annotation id:: 64428b7b-0678-4fc3-a105-71fdc6185144 hl-color:: yellow- Example 1:
move $t0, $t1->add $t0, $t1, $zero - Example 2:
blt $t0, $t1, LABEL1->slt $at, $t0, $t1; bne $at, $zero, LABEL1
- Example 1:
- The assembler turns the assembly language program into an object file, which is a combination of ==machine language instructions==, ==data==, and information needed to place instructions properly in memory (==symbol table==, ==relocation information==). hl-page:: 148 ls-type:: annotation id:: 64428ce2-cd44-4d6d-a311-dc7aa3f656ab hl-color:: yellow
- The object file for UNIX systems typically contains six distinct pieces: object file header, text segment, static data segment, relocation information, symbol table, and debug information hl-page:: 148 ls-type:: annotation id:: 64428f56-bd42-4cc4-a4dd-bf8b05e76fc3 hl-color:: yellow
- pseudoinstructions: assembler translates these instructions into equivalent machine instructions. Register
- Linker
ls-type:: annotation
hl-page:: 149
hl-color:: yellow
id:: 64428f9a-632e-4bdb-80d5-56febc2bb977
- Re-compile the whole program at each change to a single procedure is huge waste, so compile/assemble independently and finally link them together.
- 3 steps for the linker:
- hl-page:: 149
ls-type:: annotation
id:: 6442909f-b0b5-4706-8454-f89466e3ff7e
hl-color:: yellow
- Place code and data modules symbolically in memory.
- Determine the addresses of data and instruction ==labels==.
- Patch both the internal and external ==references==.
- hl-page:: 149
ls-type:: annotation
id:: 6442909f-b0b5-4706-8454-f89466e3ff7e
hl-color:: yellow
- Example Problem: Linking Object Files hl-page:: 150 ls-type:: annotation id:: 6442957a-fb16-4069-8e74-d470c01ba18c hl-color:: yellow
- Dynamically Linked Libraries
ls-type:: annotation
hl-page:: 152
hl-color:: yellow
id:: 644292fb-c10c-43f6-8152-941b009e14c2
- Library routines are not linked and loaded until the program is run. Keep extra info on the location and name of non-local procedures. hl-page:: 152 ls-type:: annotation id:: 644295c2-95e3-4da8-ba89-5470c9049fce hl-color:: yellow
- The program loader uses the extra information to find the proper libraries and ==update all external references==.
- Lazy procedure linkage: Instead of linking all library routines that might be called, link only those are actually called at runtime.
- Assume there is a table of entries for external routines, at static linkage stage, set them all to a dummy address of a dynamic linker/loader. At runtime, the program jumps to this dummy address, and executes this linker/loader which finds the desired routine, remaps it and changes the address in the indirect jump location. Next time this routine is called, this indirect jump will go to the desired routine.
- Assembler
ls-type:: annotation
hl-page:: 147
hl-color:: yellow
id:: 64428b72-2291-4380-a2ad-dc5cdc82315e
- A C Sort Example to Put It All Together
ls-type:: annotation
hl-page:: 155
hl-color:: yellow
id:: 6442a0c7-e518-46f3-979d-9c104093343b
collapsed:: true
- Skipped, since it is easy
- Arrays versus Pointers
ls-type:: annotation
hl-page:: 164
hl-color:: yellow
id:: 6442a0b8-5961-423d-bbcf-96cf55fd55cf
collapsed:: true
- An example piece of code which iterate over an array by both pointer and index
- Skipped, since it is easy
- Advanced Material: Compiling C and Interpreting Java
ls-type:: annotation
hl-page:: 168
hl-color:: yellow
id:: 6442a09e-3edb-4593-833a-626506177900
collapsed:: true
- Skipped, since it is compiler's job (Control Flow Graph???)
- Real Stuff: ARMv7 (32-bit) Instructions ls-type:: annotation hl-page:: 194 hl-color:: yellow id:: 6442a15d-dfe5-49b0-9b78-5f109e674e2f
- Real Stuff: x86 Instructions ls-type:: annotation hl-page:: 198 hl-color:: yellow id:: 6442a152-baff-4ccb-aee2-2f548e14903e
- Real Stuff: ARMv8 (64-bit) Instructions
ls-type:: annotation
hl-page:: 207
hl-color:: yellow
id:: 6442a1ad-0491-48d7-84b7-7feba159d9dd
collapsed:: true
- The philosophy of ARMv8 is much closer to MIPS than ARMv7. For example, the
$zero, thebeq/bneinstead of the condition bit
- The philosophy of ARMv8 is much closer to MIPS than ARMv7. For example, the
- Design Principles
collapsed:: true
- Design Principle 1: Simplicity favors regularity. ls-type:: annotation hl-page:: 88 hl-color:: yellow id:: 644152b5-ee31-4da9-86e4-33d7472f04c3
- Design Principle 2: Smaller is faster. ls-type:: annotation hl-page:: 90 hl-color:: yellow id:: 644152aa-00c2-4542-abaf-7048e6d37904
- Design Principle 3: Good design demands good compromises. ls-type:: annotation hl-page:: 106 hl-color:: yellow id:: 64415292-0727-4366-8717-ecca11267baf
- Word List 2
collapsed:: true
- palatable 可口的;味美的 ls-type:: annotation hl-page:: 86 hl-color:: green id:: 64412a38-84ed-4f97-b83d-911772eb7158
- rationale 基本原理;根本原因 reason hl-page:: 86 ls-type:: annotation id:: 64412bbd-513c-4e00-8576-7ef88749e552 hl-color:: green
- moot 无考虑意义的 ls-type:: annotation hl-page:: 99 hl-color:: green id:: 64414e6e-e6f7-4d05-865f-a18455c509ba
- dichotomy 二分法;两面性;(the separation between two opposite groups) hl-page:: 117 ls-type:: annotation id:: 6441591a-02ed-4556-8fd7-5fdb310063e7 hl-color:: green
- spill (使)洒出,泼出,溢出: ls-type:: annotation hl-page:: 121 hl-color:: green id:: 64416330-597c-4245-8d53-a5dc643ea05f
- wax and wane 月亮盈/亏 hl-page:: 127 ls-type:: annotation id:: 64416671-9318-4296-9588-c0421c02cdd2 hl-color:: green
- interpose 将…置于(二者)之间;插话 hl-page:: 144 ls-type:: annotation id:: 64428502-8d37-4bad-a24f-6edfa9796740 hl-color:: green
- succinct 简明的;言简意赅的 concise hl-page:: 148 ls-type:: annotation id:: 64428c72-5e4f-4f33-a1ac-cadd4610f04a hl-color:: green
- stitch 缝 ls-type:: annotation hl-page:: 149 hl-color:: green id:: 6442903b-ae86-4608-8d83-60771782b088
- anatomy 解剖学 ls-type:: annotation hl-page:: 168 hl-color:: green id:: 64429b75-4433-4e39-8085-ad2e791dbf33
- headstart 领先 ls-type:: annotation hl-page:: 207 hl-color:: green id:: 6442a2ee-8c2e-4ed2-a67f-23db50972a71
- toil (长时间)苦干,辛勤劳作 hl-page:: 209 ls-type:: annotation id:: 6442a05d-69ba-4161-9ffa-bb305a17fcf1 hl-color:: green
-
Arithmetic for Computers
hl-page:: 225 ls-type:: annotation id:: 6442a5d0-e073-4cd4-a6c6-1bb664ee952a hl-color:: yellow - Addition and Subtraction
ls-type:: annotation
hl-page:: 227
hl-color:: yellow
id:: 64433f1c-c023-429e-88d0-46cad778477c
collapsed:: true
- Addition is to add digits bit by bit from right to left with carries passed to the left digit.
- Subtraction uses addition, negate the second operand before adding.
- Overflow
collapsed:: true
- The result cannot be represented with the hardware.
- ==No overflow== can occur when ==adding operands with different signs== or ==subtracting operands with the same sign==.
- Overflow occurs when adding 2 positives and the sum is negative, or vice versa; and when subtracting a negative from a positive and get a negative, or vice versa.
- For a software detection, you can use
xorto detect sign difference. - For overflow (carry) of unsigned numbers, though often ignored, use the inequation
(\text{MAXUINT})2^{32}-1 \lt A + B \rightarrow 2^{32}-1 -A \lt B \rightarrow \overline{A} \lt B - 这里补充一下408的内容,说了3种判断方法(不过本质上一样的),设
A + B = S- 一位符号位,就是英文教材里面的方法,适合软件判断 (因为你没有进位信号也没有双符号位)
\text{OF} = A_sB_s\overline{S_s}+\overline{A_s}\overline{B_s}S_s - 两位符号位,无非就是给MSB前面添2位罢了。计算结果的双符号位
S_{s1}S_{s2}有4种组合,分别表示无溢出和正负溢出,判断为\text{OF} = S_{s1}\oplus S_{s2} - 符号位进位和最高位进位,
\text{OF} = C_{n} \oplus C_{n-1}
- 一位符号位,就是英文教材里面的方法,适合软件判断 (因为你没有进位信号也没有双符号位)
- For a software detection, you can use
- In MIPS,
add/addi/subcauses exceptions on overflow; whileaddu/addiu/subudoes not cause exceptions on overflow.- Since C ignores overflow, it always uses
*uinstructions.
- Since C ignores overflow, it always uses
- saturating operation: When overflow, set the result to the MAX/MIN value rather than a modulo to 2^32 hl-page:: 230 ls-type:: annotation id:: 644347f7-9833-4208-99f7-655f94b5a7b5 hl-color:: yellow
- Multiplication
ls-type:: annotation
hl-page:: 232
hl-color:: yellow
id:: 64434f2d-d63b-4840-96da-918f7a04cb97
collapsed:: true
- Names of the operands:
product = multiplicand * multiplier - Observation
- n-bit multiplicand and m-bit multiplier result in a (m+n)-bit product (overflow)
- The manual multiplication method in essence is a ==shift-and-add== process.
- Sequential Version of the Multiplication Algorithm and Hardware
ls-type:: annotation
hl-page:: 233
hl-color:: yellow
id:: 64435253-b5bf-4832-ad91-025cede3bafd
collapsed:: true
- Naive version
- Three registers, namely 64-bit multiplicand, 32-bit multiplier and 64-bit product.
{:height 223, :width 449}- Pseudo code for the algorithm
uint64_t multiplicand = A; uint32_t multiplier = B; uint64_t product = 0; for (int i = 0; i < 32; ++ i) { if (multiplier & 0x1) product += multiplicand; // 1. test multiplier[0] and add to product // else do nothing, or add 0 multiplicand <<= 1; // left shift multiplicand multiplier >>= 1; // right shift multiplier } - Though the textbook says that each iteration takes 3 clock cycles, I think all these can be done in 1 cycle (虽然时序会比较垃圾就是了). The following refined version no doubt needs only 1 cycle each iteration.
- Refined version
- Naive version
- Signed Multiplication
ls-type:: annotation
hl-page:: 236
hl-color:: yellow
id:: 64435d8f-d70f-4968-be40-ddbf8ef5a19e
- The easiest solution is that, first convert all operands to positive and calculate the sign separately; after multiplication, convert the the product to its correct sign.
- The refined version is ready to deal with signed multiplication by the following 2 steps:
- Enable sign extension on right shift of product register.
- Subtract rather than add on the last partial product. This operation originates from ((64414f1b-142c-4301-a5a2-6dc0ad3b102b))
- Then we can get a 32-bit product in the lower word of the product register.
- Faster Multiplication
ls-type:: annotation
hl-page:: 236
hl-color:: yellow
id:: 6443dd4d-c534-4911-9e81-3b4b0ef396d9
collapsed:: true
- A balance between resource and speed
- FIGURE 3.7 Fast multiplication hardware. hl-page:: 237 ls-type:: annotation id:: 6443e016-d955-4433-9956-46ad682890ae hl-color:: yellow collapsed:: true
- There are many other ways to implement a multiplier circuit, such as Array Multiplier using Carry-Save Addition, or pipeline it, or booth.
- Principle of booth algorithm
- The simplest Radix-2 booth multiplier is based on such an observation (again):
A = A_{\text{n-1}}A_{\text{n-2}}\dots A_{\text{1}}A_{\text{0}} \\ = - A_{n-1} \times 2^{n-1} + \sum_{i=0}^{n-2} A_{i}\times 2^{i} \\ = - A_{n-1} \times 2^{n-1} + (2 - 1)\sum_{i=0}^{n-2} A_{i}\times 2^{i} \\= (A_{n-2}- A_{n-1})\cdot 2^{n-1} + (A_{n-3}- A_{n-2})\cdot2^{n-2} \cdots (A_{1}- A_{0})\cdot2^{1} + (A_{-1}- A_{0})\cdot 2^{0}- When $A_{i-1} = A_{i}$, the result is 0. Thus, Radix-2 Booth Algorithm examines the 2 LSBs and decides which operation to perform (shift (`00/11`) or add (`01`) or subtract (`10`)). - Extending to Radix-4, the item looks like this:
(A_{2k+1}-2A_{2k}+A_{2k-1})\times 2^{2k}. And we will have a more complicated operation table since the algorithm examines 3 bits. - Radix-4 Booth Algorithm halves the number of partial products, thus improving the performance.
- The simplest Radix-2 booth multiplier is based on such an observation (again):
- Names of the operands:
- Division
ls-type:: annotation
hl-page:: 238
hl-color:: yellow
id:: 6443e24f-a691-45aa-809c-2e01aca20e0b
collapsed:: true
- $\text{dividend} = \text{quotient} \times \text{divisor} + \text{remainder}, \text{divisor} \gt \text{remainder}$
collapsed:: true
- As for signed division, watch out for the remainder. There may be more than one seemingly reasonable pair of (quotient, remainder). One general rule for this is that, remainder has the same sign as the dividend.
- A Division Algorithm and Hardware (Unsigned)
hl-page:: 238
ls-type:: annotation
id:: 6443e2eb-9d81-471b-9033-0e325140b4f2
hl-color:: yellow
collapsed:: true
- Naive version

- Pseudo code
void div(uint32_t A, uint32_t B) { uint64_t Divisor = B << 32; uint64_t Remainder = A; uint32_t Quotient = 0; for (int i = 0; i < 33; ++ i) { Remainder = Remainder - Divisor; // 1. try subtract if (Remainder >= 0) { Quotient = (Quotient | 1) << 1; // 2.a. suffice } else { Quotient = (Quotient | 0) << 1; // 2.b. cannot subtract, restore Remainder = Remainder + Divisor; } Divisor = Divisor >> 1; // 3. next bit } }
- Refined version
collapsed:: true

- Use less resource, only a 64-bit register is needed, which is
0 | Dividendat initialization andRemainder | Quotientafter 32 cycles (32 left shifts). - A working SystemVerilog implement
module divider( input logic clk, input logic rst, input logic en, input logic [31:0] operandA, input logic [31:0] operandB, output logic operation_valid, output logic busy, output logic [63:0] result ); // unsigned divider parameter COUNT = 6'd32; logic [5:0] count; logic [31:0] divisor; logic [31:0] alu_result; logic restore; logic [63:0] remainder; always @(posedge clk, negedge rst) if (!rst) divisor <= 32'b0; else if (en && count == 0) divisor <= operandB; always @(posedge clk, negedge rst) if (!rst) remainder <= 64'b0; else if (en && count == 0) remainder <= {32'b0, operandA}; else if (busy) begin if (restore) remainder <= remainder << 1; else remainder <= {alu_result[30:0], remainder[31:0], 1'b1}; end always @(posedge clk, negedge rst) if (!rst || !en) count <= 0; else if (count == 0) count <= 1; else if (count < COUNT) count <= count + 1; else count <= 0; assign operation_valid = (operandB == 32'b0 && en) ? 1'b0 : 1'b1; assign busy = en && count; assign alu_result = remainder[63:32] - divisor; assign restore = remainder[63:32] < divisor; assign result = {alu_result[31:0], remainder[30:0], ~restore}; // due to implementation issues, the remainder part will be over-shifted in the end // here is a workaround endmodule
- 无符号除法也可以用加减交替法(Non-restoring Division),一种简单的改进。国内的计组教材上讲的都是定点小数,如果需要做整数的话,需要把 Divisor 先左移 N 位。好像说,不恢复余数法,其实是 SRT 方法的一种特殊情况来着。感觉还是没怎么搞明白,这东西可以单独开一门课,不过无所谓了,反正题会做就行。
- 我的评价是,看这个吧。COMPUTER ARITHMETIC : Algorithms and Hardware Designs
- 补码除法(爱来自408)
- 加减交替法:符号位和数值位一起参加运算(全部是补码),商符自然形成。
- 先做一次加减法运算:若 Dividend 和 Divisor 同号,则相减;否则相加。
- 然后重复N次:若 Remainder 和 Divisor 同号,商上1,左移 Remainder 并减 Divisor;否则 Quotient 上0,左移 Remainder 并加 Divisor
- 最后一步给 Quotient 恒置1
- 不过说实在的,我没理解,手动算好像也不对劲,==不知道哪里出了问题==,过天再看看。这东西的设计还挺好玩的。
- Faster Division
ls-type:: annotation
hl-page:: 243
hl-color:: yellow
id:: 6444043e-8c74-474e-a4f0-b2011ebb9b10
- Similar to multiplier, there are also many ways to build a divider. However, unlike multiplier, divider cannot use array-adder, since it cannot be known ahead whether the subtraction is available. There is a method based on lookup table and prediction, called SRT division.
- Naive version
- $\text{dividend} = \text{quotient} \times \text{divisor} + \text{remainder}, \text{divisor} \gt \text{remainder}$
collapsed:: true
- Floating Point
ls-type:: annotation
hl-page:: 245
hl-color:: yellow
id:: 644410ff-f158-431d-ab11-22f32e63a6da
collapsed:: true
- Normalized number: a number in scientific notation without leading 0s.
- binary point: the point, but in base 2 hl-page:: 245 ls-type:: annotation id:: 644494f0-4e7e-4356-bf49-5d7f61999a78 hl-color:: yellow
- floating point normalized form:
1.xxxxxxxx_{\text{two}} \times 2^{yyyy}. Since there is no leading 0s, the only bit to the left of the binary point is 1. - Floating-Point Representation
ls-type:: annotation
hl-page:: 246
hl-color:: yellow
id:: 644495bf-6828-4439-a332-f5a9a176adaf
- A single-precision floating point has 32 bits,
1 | 8 | 23bit(s) for the 3 componentss | exponent | fraction - A double-precision floating point has 64 bits,
1 | 11 | 52bits for the 3 components. - General form of floating-point numbers:
(-1)^{\text{s}} \times \text{F} \times 2^{\text{E}} - overflow: the exponent is too large hl-page:: 247 ls-type:: annotation id:: 6444979f-bc17-4cf6-9a68-0a7098db96b6 hl-color:: yellow
- underflow: the ==negative== exponent is too large hl-page:: 247 ls-type:: annotation id:: 644497a1-f169-4386-8e44-cf46566f6d47 hl-color:: yellow
- significant: the 24-bit or 53-bit number comprised of the implicit leading 1 and the fraction.
- IEEE 754 encoding of floating-point numbers.
hl-page:: 248
ls-type:: annotation
id:: 6444991d-e3c9-4161-863e-2c65db02a575
hl-color:: yellow
- Represent 0: Since 0 has no leading 1, a reserved exponent
0is there to represent the number. - Represent Infinity/NaN:Two unusual cases are given another reserved exponent
255/2047, representing infinity (fraction = 0) and NaN (fraction != 0) - Biased notation: To simplify the sorting of floating-point numbers, the exponent field is designed to be an unsigned integer. But we also have to represent negative exponents, thus the exponent field is biased by
127/1023. In other words, the real value of the exponent is the exponent field subtract bias. - The real value of an IEEE-754 floating-point (==normalized==) number could be expressed as:
(-1)^{\text{s}}\times (1+\text{Fraction})\times 2^{\text{Exponent-Bias}}- Ranges from
\pm 1.00\dots00_{\text{two}}\times 2^{-126}to\pm 1.11\dots11_{\text{two}}\times 2^{+127}
- Ranges from
- de-normalized numbers: The exponent field is
0, but the actual exponent is-126/-1022. And there is no implicit leading 1. This form can represent a number smallest down to $0.00\dots01_{\text{two}} \times 2^{-126} = 1.0_{\text{two}}\times 2^{-149}$ hl-page:: 271 ls-type:: annotation id:: 6444a660-b04c-4dd5-9f03-1df86f5feaf2 hl-color:: yellow collapsed:: true- However, this prevents FPUs from getting faster, some architects raise exceptions for de-normalized IEEE-754 (they just don't implement such support)
- Represent 0: Since 0 has no leading 1, a reserved exponent
- A single-precision floating point has 32 bits,
- Floating-Point Addition
ls-type:: annotation
hl-page:: 252
hl-color:: yellow
id:: 64449f20-2e2b-4aba-a152-fb5013cee9df
- FIGURE 3.14 Floating-point addition. ls-type:: annotation hl-page:: 254 hl-color:: yellow id:: 6444a190-55ce-4d72-b776-f65bd79c402e
- (1) Compare the exponents of the 2 numbers. Shift the smaller number to the right until its exponent would match the larger one
- (2) Add the significands
- (3) Normalize the sum, either
rshandexp++orlshandexp--- Check overflow/underflow
- (4) Round the significand
- Check if the result is normalized, in case rounding adds to the MSB. If not, go to (3).
- FIGURE 3.15 Block diagram of an arithmetic unit dedicated to floating-point addition. ls-type:: annotation hl-page:: 256 hl-color:: yellow id:: 6444a33a-1be3-46cf-88d7-bda594ff889b
- Floating-Point Multiplication
ls-type:: annotation
hl-page:: 255
hl-color:: yellow
id:: 6444a351-91c2-4d55-b74d-e9b3bf0b1f3f
- FIGURE 3.16 Floating-point multiplication. ls-type:: annotation hl-page:: 258 hl-color:: yellow id:: 6444a815-7f89-4638-9bc0-a2539cf80a27
- (1) Add the biased exponents of the 2 numbers (and subtract one bias since it is added twice) to get the new exponent field
- (2) Multiply the significands
- Different from addition, ==exponent alignment is no needed==. Directly multiply the significands.
- (3) Normalize and check over/underflow
- (4) Round the significand to the appropriate number of bits, and check normalized (or go to (3))
- (5) Set the sign of the product
- Floating-Point Instructions in MIPS
ls-type:: annotation
hl-page:: 260
hl-color:: yellow
id:: 6444a4ea-ad13-4a57-a881-cb7cbae7f65f
- Special instructions: arithmetic(single/double)
add.s/d, comparisonc.eq.s/d, branchbclt/bclf, data transferlwc1/swc1 - Floating-point registers:
$f0to$f31, each 32-bit. A double-precision register is actually an even-odd pair of single-precision registers (e.g., double register$f2={$f2, $f3})
- Special instructions: arithmetic(single/double)
- Accurate Arithmetic
ls-type:: annotation
hl-page:: 267
hl-color:: yellow
id:: 6444ac21-f605-497c-b86e-4d36484f6e3a
- Keep 2 extra bit on the right during ==intermediate additions==, since hardware cannot hold infinite bits for intermediates. They are guard and round.
- sticky bit: a third bit which indicates whether there are any non-zero bits to the right of the round bit
- units in the last place (ulp): The number of bits in error in the LSBs (right-most bits) of the significand between the actual number and the rounded number. ==Measure of accuracy==. hl-page:: 268 ls-type:: annotation id:: 6444af4d-4e28-41ab-97b9-a4442cbe9d9a hl-color:: yellow
- IEEE 754 has 4 rounding modes: always round up (toward
+\infin), always round down(toward-\infin), truncate, and round to nearest even. hl-page:: 268 ls-type:: annotation id:: 6444b015-6430-4fce-b780-988a0caa63f5 hl-color:: yellow
- Keep 2 extra bit on the right during ==intermediate additions==, since hardware cannot hold infinite bits for intermediates. They are guard and round.
- Parallelism and Computer Arithmetic: Subword Parallelism
ls-type:: annotation
hl-page:: 271
hl-color:: yellow
id:: 6444bd6c-1170-4f97-932b-af66246ff486
collapsed:: true
- Many multimedia applications use 8-bit or 16-bit data units, thus the processor can perform simultaneous operations on short vectors of these smaller operands (which are stored in a single word-size register)
- subword parallelism, data level parallelism, SIMD
- Real Stuff: Streaming SIMD Extensions and Advanced Vector Extensions in x86
ls-type:: annotation
hl-page:: 273
hl-color:: yellow
id:: 6444bda7-e1ae-495b-86eb-5507878df5d8
collapsed:: true
- multiple floating-point operands packed into a single 128-bit SSE2 register hl-page:: 273 ls-type:: annotation id:: 6444bfc1-7314-4629-ae26-defc51e37a96 hl-color:: yellow
- load and store multiple operands per instruction, perform arithmetic operations on multiple operands
- Going Faster: Subword Parallelism and Matrix Multiply
ls-type:: annotation
hl-page:: 274
hl-color:: yellow
id:: 6444bdac-2bc0-490a-ac86-88c702b83c65
collapsed:: true
- DGEMM: Double precision GEneral Matrix Multiply. A commonly used program for demonstration. hl-page:: 274 ls-type:: annotation id:: 6444c0b0-7629-423c-b600-03286f22c6bd hl-color:: yellow
- An interesting example for how to use SIMD to speedup matrix multiply.
void dgemm(int n, double* A, double* B, double* C) { for (int i = 0; i < n; ++i) for (int j = 0; j < n; ++j) { double cij = C[i+j*n]; /* cij = C[i][j] */ for( int k = 0; k < n; k++ ) cij += A[i+k*n] * B[k+j*n]; /* cij += A[i][k]*B[k][j] */ C[i+j*n] = cij; /* C[i][j] = cij */ } } void dgemm_AVX(int n, double* A, double* B, double* C) { for (int i = 0; i < n; i+=4) for (int j = 0; j < n; j++) { __m256d c0 = _mm256_load_pd(C+i+j*n); /* c0 = C[i][j] */ for(int k = 0; k < n; k++){ __m256d b = _mm256_broadcast_sd(B+k+j*n) c0 = _mm256_add_pd(c0, _mm256_mul_pd(_mm256_load_pd(A+i+k*n), b)); } /* c0 += A[i][k]*B[k][j] */ _mm256_store_pd(C+i+j*n, c0); /* C[i][j] = c0 */ } } void dgemm_AVX_UNROLL(int n, double* A, double* B, double* C) { for ( int i = 0; i < n; i+=UNROLL*4 ) for ( int j = 0; j < n; j++ ) { __m256d c[UNROLL]; for ( int x = 0; x < UNROLL; x++ ) c[x] = _mm256_load_pd(C+i+x*4+j*n); for( int k = 0; k < n; k++ ) { __m256d b = _mm256_broadcast_sd(B+k+j*n); for (int x = 0; x < UNROLL; x++) c[x] = _mm256_add_pd(c[x],_mm256_mul_pd(_mm256_load_pd(A+n*k+x*4+i), b)); } for ( int x = 0; x < UNROLL; x++ ) _mm256_store_pd(C+i+x*4+j*n, c[x]); } }
- Fallacies and Pitfalls
ls-type:: annotation
hl-page:: 278
hl-color:: yellow
id:: 6444bdb2-6eb9-44b2-afcf-0a844265f162
collapsed:: true
- Pitfall: ==Floating-point addition is not associative==. ls-type:: annotation hl-page:: 278 hl-color:: yellow id:: 6444c121-8df1-44b0-8077-5258bb7e907d
- Parallel execution strategies that work for integer data types ==NOT always work for floating-point== data types. hl-page:: 279 ls-type:: annotation id:: 6444c1f0-95bb-4cb4-b82c-b68e52f8b53b hl-color:: yellow
- Pitfall: The MIPS instruction add immediate unsigned (addiu) ==sign-extends== its 16-bit immediate field. ls-type:: annotation hl-page:: 279 hl-color:: yellow id:: 6444c1c5-55d6-4ee4-a522-7775b374d047
- Word List 3
collapsed:: true
- quirk 怪异的性格(或行为);怪癖 ls-type:: annotation hl-page:: 227 hl-color:: green id:: 64433ef2-0591-499f-b056-b2664d81156e
- vex 使恼火;使烦恼;使忧虑 hl-page:: 232 ls-type:: annotation id:: 64434f3c-f512-4f58-a6f2-05e7aa355ec6 hl-color:: green
- vague 不明确的;含糊的;模糊的; ls-type:: annotation hl-page:: 267 hl-color:: green id:: 6444aca4-859a-4303-b126-5b4fb95801ea
- equitable 公正的,合理的 hl-page:: 268 ls-type:: annotation id:: 6444ae4a-b788-41ca-bf5a-f3e65eae3f37 hl-color:: green
- quandary 困惑;进退两难;困窘 - delimma hl-page:: 279 ls-type:: annotation id:: 6444c22c-5633-40b1-9a04-dc9ac66b5cbd hl-color:: green
- glitch 小故障;小差错 hl-page:: 281 ls-type:: annotation id:: 6444c19b-acea-4882-994b-06a1c34e86fb hl-color:: green
-
The Processor
ls-type:: annotation hl-page:: 291 hl-color:: yellow id:: 6444be95-7530-4e08-bdf8-c05ea9cc5b9d - Introduction
ls-type:: annotation
hl-page:: 293
hl-color:: yellow
id:: 6444e82b-27ec-4755-9dd9-e1d543ab66d3
collapsed:: true
- A Basic MIPS Implementation
ls-type:: annotation
hl-page:: 293
hl-color:: yellow
id:: 6444e833-44e0-4dd0-82f9-cdcdac95dc04
- A subset of MIPS ISA: Memory-reference (
lwsw), Arithmetic (addsubandorslt), and Branch (beqj) - Several common steps:
-
- Send the PC to memory and fetch the instruction
-
- Read 1 or 2 registers, using the fields of the instruction
-
- Except
j, all instruction classed use the ALU after reading the registers, though for different purposes (arithmetic, address calculation, comparison)
- Except
-
- After ALU, the actions required to complete various classes of instructions differ, such as load/store memory, write to register or change PC.
-
- FIGURE Abstract view of the MIPS subset's implementation hl-page:: 295 ls-type:: annotation id:: 64477e93-0b18-4efa-83c0-2c8e71b90457 hl-color:: yellow
- FIGURE Basic implementation with multiplexors/control.
hl-page:: 296
ls-type:: annotation
id:: 64477f1f-1cb6-4a55-bfad-148b4a3fb467
hl-color:: yellow
- Multiplexor: One destination may have multiple sources, and thus we need to select from these sources according to the type of the instruction.
- Control unit: accepts the instruction as input, and generates signals to control other functional units (e.g., ALU, Memory) and the multiplexors.
- A subset of MIPS ISA: Memory-reference (
- A Basic MIPS Implementation
ls-type:: annotation
hl-page:: 293
hl-color:: yellow
id:: 6444e833-44e0-4dd0-82f9-cdcdac95dc04
- Logic Design Conventions
ls-type:: annotation
hl-page:: 297
hl-color:: yellow
id:: 64477d00-f17e-40fd-8178-6e21175025a0
collapsed:: true
- Combinational elements and State elements
collapsed:: true
- For combinational, outputs depend only on the current inputs hl-page:: 297 ls-type:: annotation id:: 6447833a-b83a-40a0-98fa-59b6fa78c15c hl-color:: yellow
- State elements completely characterize the computer, which has (at least) 2 inputs and 1 output. The clock is used to determine when to write, and a state element can be read at any time. hl-page:: 297 ls-type:: annotation id:: 644a2196-f210-49e3-bf47-5439713057aa hl-color:: yellow
- Clocking Methodology
hl-page:: 298
ls-type:: annotation
id:: 644a2259-9a11-4a2d-bbb8-bc71ef75b9ab
hl-color:: yellow
collapsed:: true
- Edge-triggered clocking: state elements are only updated on a clock edge.
- Combinational logic must have its inputs come from a set of state elements and its outputs written into a set of state elements. These inputs are values written in a previous cycle, while the outputs are values that can be used in a following clock cycle. hl-page:: 298 ls-type:: annotation id:: 644a24c1-45e7-4763-8284-25b7d875d2b6 hl-color:: yellow
- Combinational elements and State elements
collapsed:: true
- Building a Datapath
ls-type:: annotation
hl-page:: 300
hl-color:: yellow
id:: 644a22b7-e481-4b81-b97d-717955fa09f6
collapsed:: true
- Program Counter and Instruction Memory
- PC register, Adder, Instruction memory's address input and data output
- FIGURE 4.6 ls-type:: annotation hl-page:: 302 hl-color:: yellow id:: 644a5b33-a765-4cfb-a990-55668110f7cc
- R-Format
- register file: Each register can be read/written by specifying the register number
hl-page:: 301
ls-type:: annotation
id:: 644a5b95-b5ad-4568-b41c-783f59940a6c
hl-color:: yellow
- We need to read 2 registers and write 1 register, and this gives an intuition about the interface of the register file: 2 read address, 2 read output, and 1 write address, 1 write data, and an additional control signal
RegWritecontrolling whether to write. - Write to register file is edge-triggered and with an explicit signal, while reads are combinational
- We need to read 2 registers and write 1 register, and this gives an intuition about the interface of the register file: 2 read address, 2 read output, and 1 write address, 1 write data, and an additional control signal
- ALU: 2 32-bit inputs and a 32-bit result (as well as a 1-bit signal for zero flag). Additionally, there is a control signal
ALU Operationhl-page:: 301 ls-type:: annotation id:: 644a5d36-fabb-4719-831d-fbcfe797b925 hl-color:: yellow - FIGURE 4.7 ls-type:: annotation hl-page:: 302 hl-color:: yellow id:: 644a603d-78ee-4284-ab39-8ae5dfb19818
- register file: Each register can be read/written by specifying the register number
hl-page:: 301
ls-type:: annotation
id:: 644a5b95-b5ad-4568-b41c-783f59940a6c
hl-color:: yellow
- Memory Reference
hl-page:: 303
ls-type:: annotation
id:: 644a610a-9f03-4213-a7f0-87a0d519909b
hl-color:: yellow
- Need register file and ALU to compute the target address
- Sign-Extend Unit: sign-extend the 16-bit immediate field in the instruction hl-page:: 303 ls-type:: annotation id:: 644a6150-f971-4fb1-a2be-e2fd0d068e1c hl-color:: yellow
- Data Memory: Despite the read address and read output, since it is writable, write address, write data and write control are needed as well.
- FIGURE 4.8 ls-type:: annotation hl-page:: 304 hl-color:: yellow id:: 644a6212-7e80-4d72-9242-2e719b8a43d8
- Branch
hl-page:: 303
ls-type:: annotation
id:: 644a627d-01be-4bb7-b84d-c7b66c1a48a0
hl-color:: yellow
- Comparison between 2 operand registers: re-use the ALU
- branch taken and not taken: Replace PC with, branch target address or incremented PC
- Compute branch target address: Sign-Extension and Adder
hl-page:: 303
ls-type:: annotation
id:: 644a6329-781d-4c60-b194-ac5aafba89f6
hl-color:: yellow
- Sign-extend the constant(offset) field of the instruction
- The relative base of this computation is
PC + 4 - The offset field needs to be left-shifted by 2
- FIGURE 4.9 ls-type:: annotation hl-page:: 305 hl-color:: yellow id:: 644a64b3-3c73-48cb-b5b6-428a3cf7e961
- Comparison between 2 operand registers: re-use the ALU
- Creating a Single Datapath
hl-page:: 305
ls-type:: annotation
id:: 644a62f9-c49b-4e03-9ef3-deafe6f9f37d
hl-color:: yellow
- execute all instructions in a Single clock cycle, so no datapath resource can be used more than once per instruction hl-page:: 305 ls-type:: annotation id:: 644a6592-a327-499f-81ae-1b4a6e7be71a hl-color:: yellow
- To ==share a datapath element== between two different instruction classes, we may need to allow multiple connections to the input of an element, using a ==multiplexor and control signal== to select among the multiple inputs.
hl-page:: 305
ls-type:: annotation
id:: 644a6f81-bceb-4923-afcb-437fcbc26b91
hl-color:: yellow
- Share an ALU for Memory-Reference and Arithmetic instructions
- Share the write-back path between
lwand Arithmetic - Share sign-extend unit between Branch and Memory
- A separate Add unit to calculate branch target
- FIGURE 4.11 The simple datapath for the core MIPS architecture combines the elements required by different instruction classes. ls-type:: annotation hl-page:: 307 hl-color:: yellow id:: 644a6fb3-921e-4395-b132-f4be1a542268
- Program Counter and Instruction Memory
- A Simple Implementation Scheme
ls-type:: annotation
hl-page:: 308
hl-color:: yellow
id:: 644a6306-3290-4abd-b506-7234340a9977
collapsed:: true
- With the datapath construction above, we add a control function to complete the implementation.
- The ALU Control
ls-type:: annotation
hl-page:: 308
hl-color:: yellow
id:: 644a7005-84e0-47ae-8f10-0bc02cd041f3
- multiple levels of decoding
hl-page:: 309
ls-type:: annotation
id:: 644a72c6-9a8f-407f-95df-031a6598d422
hl-color:: yellow
- Main Control generates
ALUOpwhich indicates the instruction class (Memory, Branch, Arithmetic). And theALUOptogether withfunctfield generate the actual signals to control ALU - This technique leads to smaller control unit, which is potentially faster
- Main Control generates
- A truth table that maps Instructions to the ALU control input
- multiple levels of decoding
hl-page:: 309
ls-type:: annotation
id:: 644a72c6-9a8f-407f-95df-031a6598d422
hl-color:: yellow
- Designing the Main Control Unit
ls-type:: annotation
hl-page:: 310
hl-color:: yellow
id:: 644a71a0-c19d-40aa-bef1-eefdb8b311df
- The input of this "function" is the 6-bit
OPfield of Instruction, and the outputs are the control signals, except forALUOp(explained above) andPCSrc - The
PCSrcsignal selects the next PC, which cannot be decided from the Instruction only. Comparison result of the 2 operands is needed in combination with theOPfield to control the multiplexor connected with PC.
- The input of this "function" is the 6-bit
- Why a Single-Cycle Implementation Is Not Used Today
ls-type:: annotation
hl-page:: 320
hl-color:: yellow
id:: 644a7930-5ef2-44e8-a468-9315b8cad591
- We must assume that the clock cycle is equal to the ==worst-case delay== for all instructions, which violates the principle of ==making the common case fast==. hl-page:: 321 ls-type:: annotation id:: 644a795f-d067-4a11-ba2d-8e37d43e9b1d hl-color:: yellow
- An Overview of Pipelining
ls-type:: annotation
hl-page:: 321
hl-color:: yellow
id:: 644a7223-c882-446f-adf5-3b819386c6c2
collapsed:: true
- Speedup from pipelining
- Under ideal conditions (e.g., the stages are perfectly balanced) and with a large number of instructions, the speed-up from pipelining is approximately equal to the ==number of pipe stages== hl-page:: 324 ls-type:: annotation id:: 644a7f5b-f357-4235-b3cf-034877a5946d hl-color:: yellow
- Pipelining improves performance by ==increasing instruction throughput==, as opposed to decreasing the execution time of an individual instruction, but instruction throughput is the important metric because ==real programs execute billions of instructions==. hl-page:: 326 ls-type:: annotation id:: 644a817b-8d05-418d-829b-8952b2fa2f5c hl-color:: yellow
- Designing Instruction Sets for Pipelining
ls-type:: annotation
hl-page:: 326
hl-color:: yellow
id:: 644a8196-0384-44b9-b141-c2e78451cb7e
- Aligned instructions; Regular instruction formats; Restricted memory operands (only load/store); Aligned operands (data address)
- Pipeline Hazards
ls-type:: annotation
hl-page:: 326
hl-color:: yellow
id:: 644a80a5-e321-4d67-a16a-98fd02db800c
- hazards: There are situations in pipelining when the ==next instruction cannot execute in the following clock cycle==. hl-page:: 326 ls-type:: annotation id:: 644a85c9-d4a3-40e1-abca-0bf71ae67004 hl-color:: yellow
- Structural Hazard
hl-page:: 326
ls-type:: annotation
id:: 644a8737-93a8-474b-8029-1eff257568e0
hl-color:: yellow
- Hardware cannot support the combination of instructions (that we want to execute in the same clock cycle) hl-page:: 326 ls-type:: annotation id:: 644a8932-4413-4859-866c-c008e57947bd hl-color:: yellow
- Example: Assume that we do not have separate instruction and data memory, then a
lwwill monopolize the memory bus and thus prevent an instruction fetch in the same cycle (resulting in a bubble in the Fetch stage).
- Data Hazards
ls-type:: annotation
hl-page:: 327
hl-color:: yellow
id:: 644a8a16-8e43-4d26-a968-8af915459447
collapsed:: true
- The pipeline must be stalled because one step must wait for another to complete. More specifically, the dependence of one instruction on an earlier one that is still in the pipeline. hl-page:: 327 ls-type:: annotation id:: 644a8a51-e14d-4e22-91d9-6f4382e713c2 hl-color:: yellow
- Example:
add $s0, $t0, $t1 sub $s1, $s0, $t3 ; The add is in EX, and sub is in ID. ; The result of add isn't yet written back to reg file($s0), ; while sub needs that result - Solution: forwarding(bypassing). Directly feed the missing result to the next instruction rather than wait for the result being written back to reg file.
hl-page:: 327
ls-type:: annotation
id:: 644a8c19-dd9f-4025-9fc7-f7e6881f1ae6
hl-color:: yellow
collapsed:: true
- Forwarding cannot solve all data hazards, e.g., in a load-use case, the following instruction has to wait for data being fetched from memory.
- pipeline stall ls-type:: annotation hl-page:: 329 hl-color:: yellow id:: 644a8e47-9fcf-4944-8036-1c831ec18918
- Another solution: re-ordering the instructions.
- Control(Branch) Hazards
hl-page:: 330
ls-type:: annotation
id:: 644a8dc0-abb5-4c17-bece-30484d124bc7
hl-color:: yellow
- Make a decision based on the results of one instruction while others are executing.
hl-page:: 330
ls-type:: annotation
id:: 644a8f6f-732d-41f9-ab03-66062a7b157b
hl-color:: yellow
- The branch class instruction. The pipeline cannot know what the next instruction should be until the branch is resolved.
- In the case of classical MIPS 5-stage pipeline, branch leads to an 1-cycle stall (C0: Fetch branch; C1: Decode branch, and the ALU combinational circuit already resolved the branch; C2: Fetch new instruction according to the result of ALU, and this result is written to EX stage Flip-Flop).
- Solution: prediction
hl-page:: 332
ls-type:: annotation
id:: 644a92cc-7301-491e-961a-64310ef0252b
hl-color:: yellow
- The simplest policy is to predict that each branch is not taken or taken.
- Dynamic hardware predictors make guesses depending on the behavior of each branch. For example, keep a history for each branch and use the recent past behavior to predict.
- When failed, the pipeline needs to neutralize the following instruction and restart the pipeline.
- Another solution: delayed decision
hl-page:: 333
ls-type:: annotation
id:: 644a9505-08fd-47ae-b202-3f2f2534dd90
hl-color:: yellow
- Place an instruction not affected by the branch immediately after the branch instruction, and always executes it.
- Make a decision based on the results of one instruction while others are executing.
hl-page:: 330
ls-type:: annotation
id:: 644a8f6f-732d-41f9-ab03-66062a7b157b
hl-color:: yellow
- Speedup from pipelining
- Pipelined Datapath and Control
ls-type:: annotation
hl-page:: 335
hl-color:: yellow
id:: 644a8636-5eb3-45d2-ab01-a0acd29747b0
collapsed:: true
- Five stages: IF, ID, EX, MEM, WB
- Pipeline Registers: Instruction Memory can be used only in one of these stages. To retain the value of an individual instruction for its other 4 stages, the value must be saved in a register. Therefore, there are 4 registers between the 5 stages.
hl-page:: 337
ls-type:: annotation
id:: 644bcbf6-66f0-4f82-b542-bf07ceb22716
hl-color:: yellow
- PC can be regarded as the register before IF stage, and the RF can be regarded as the register after WB
- Any information needed in a later stage must be ==passed to that stage via a pipeline register== hl-page:: 339 ls-type:: annotation id:: 644bcd54-fb9c-4d04-a133-9397d29534b8 hl-color:: yellow
- Each logical component (ALU, RF, etc.) can be used only within a single pipeline stage. Otherwise, structural hazard. This naturally divides the pipeline into 5 stages. hl-page:: 343 ls-type:: annotation id:: 644bd29e-d65e-4a73-89cc-124f907838ef hl-color:: yellow
- Graphically Representing Pipelines
ls-type:: annotation
hl-page:: 345
hl-color:: yellow
id:: 644bd357-ff90-42bc-9342-dc73102adee0
- Well, nothing special, as long as you can read the figures.
- Pipelined Control
ls-type:: annotation
hl-page:: 349
hl-color:: yellow
id:: 644bd45a-f008-440b-b367-2dad2a7267fc
- Constrained only in this section, the control signal is basically identical to the Single-Cycle version, since the data-path is unchanged.
- Control signals are generated in ID stage and passed through the pipeline via intermediate registers.
- Data Hazards: Forwarding versus Stalling
ls-type:: annotation
hl-page:: 352
hl-color:: yellow
id:: 644bd711-ebfb-429c-a340-6c17191eda7d
collapsed:: true
- ID-WB Hazard: Read and write to the same register in the same cycle.
- Solution is to add a bypass inside the RF, and the hazard is eliminated.
- Dependency with distance of ==3 or more cycles== do not lead to a hazard.
- Forwarding to EX (ALU/Branch depends on previous result)
- Condition
EX/MEM.RegisterRd = ID/EX.RegisterRs/RtMEM/WB.RegisterRd = ID/EX.RegisterRs/Rtand!(EX/MM hazard)
- Here is a special case, when there are 2 sources of forwarding available, we choose the latest one (
EX/MM.Result) rather than the outdated one. The code example is illustrated as followadd $1, $1, $2 add $1, $1, $2 add $1, $1, $2
- Additionally check
RegWriteand~RegisterRd. We don't want to forward an empty instruction nor a non-write one. - Three sources of input into ALU:
- ID/EX pipe-reg, normally read from RF
- EX/MM pipe-reg, forwarded from the prior ALU result
- MM/WB pipe-reg, forwarded from DM or earlier ALU result
- Condition
- Forwarding to MM
- Store after Load
MM/WB.RegisterRd = EX/MM.RegisterRs, since there is only one register operand in store
- Data Hazards and Stalls
ls-type:: annotation
hl-page:: 362
hl-color:: yellow
id:: 644be99c-0fd5-44e6-9e17-769890c257a0
- ALU (immediately) after load cannot be resolved by forwarding. When ALU uses the data, it is not read from DM.
In other words, the data we want only appears in
MM/WB, while in the forwardable case, the desired data is available both inMM/WBandEX/MM. - Hazard detection:
ID/EX.MemReadandID/EX.RegisterRs/Rt = IF/ID.RegisterRd - nops: To stall the pipeline, insert an empty instruction before the to-be-stalled instruction.
hl-page:: 363
ls-type:: annotation
id:: 644be9a8-d6e8-45d7-a345-bb57935e4ad6
hl-color:: yellow
loadand previous instructions flow through, we do nothing toEXMMandMMWBand RF.- Neutralize the
arithby multiplexor, we write 0 toIDEX(indeed, only control-write signals need to be cleared) - Pause fetching instruction for 1 cycle, we disable write to
IFIDand PC.
- ALU (immediately) after load cannot be resolved by forwarding. When ALU uses the data, it is not read from DM.
In other words, the data we want only appears in
- ID-WB Hazard: Read and write to the same register in the same cycle.
- Control Hazards
ls-type:: annotation
hl-page:: 365
hl-color:: yellow
id:: 644bd891-f33c-43a9-87ba-b071ec181876
collapsed:: true
- Assume Branch Not Taken
ls-type:: annotation
hl-page:: 367
hl-color:: yellow
id:: 644bd898-be2d-4c8e-bec3-50823b9a7b28
- Assume branch target is set to PC in MM stage (as the branch instruction reaches
MM/WB). Let go onlyMMWB. Write 0 toEXMM(if we ignore delay-slot)IDEXandIFID. Since there is no control signal inIFID, a special signalflushis added.
- Assume branch target is set to PC in MM stage (as the branch instruction reaches
- Moving branch resolution to ID stage
hl-page:: 367
ls-type:: annotation
id:: 644bef12-3410-4f93-a0a7-cdc121cba36d
hl-color:: yellow
- As branch resolution is brought forward, the penalty decreases to 1 instruction. Only flush
IFIDwhich forms a bubble in the pipeline. - Add a separate Add-Subtract Unit and a Compare Unit in ID stage, since they do not need much resources
- Difficulties:
- New forwarding logic in ID stage
- More stalls: when depending on ALU, 1 cycle; when depending on
load, even 2 cycles.
- As branch resolution is brought forward, the penalty decreases to 1 instruction. Only flush
- Dynamic Branch Prediction
ls-type:: annotation
hl-page:: 370
hl-color:: yellow
id:: 644bee58-04af-4372-9825-73ca79d86ba3
- Branch history table (BHT): A small memory, indexed by lower bits of the branch instruction's address, with a bit indicating if the branch is taken or not last time.
hl-page:: 370
ls-type:: annotation
id:: 644bf9bb-3d27-429c-a8f2-904365f0820e
hl-color:: yellow
- For a loop branch, the exit branch is inevitably mispredicted. When the loop is entered again, 1-bit predictor will also miss the first loop branch, since the last exit branch sets its state to not taken. Therefore, 1-bit scheme is not ideal enough.
- 2-bit prediction schemes
- a prediction must wrong twice before changed
- For each branch , BHT keeps a FSM with 4 states, Strong-take, Weak-take, Weak-not-take and Strong-not-take.
- branch target buffer
hl-page:: 373
ls-type:: annotation
id:: 644bfe25-cb9d-4a2c-baf2-6fdcee9c09f4
hl-color:: yellow
- Use a branch-instruction-address-indexed memory to store history branch target, and save the cycle of calculating branch target address
- correlating predictors
ls-type:: annotation
hl-page:: 373
hl-color:: yellow
id:: 644bfee0-90fc-478c-826d-24ab4df5a0ea
- Combine local behavior of a particular branch (as described above) with global information observed from the behavior of some recent number of executed branches.
- Branch history table (BHT): A small memory, indexed by lower bits of the branch instruction's address, with a bit indicating if the branch is taken or not last time.
hl-page:: 370
ls-type:: annotation
id:: 644bf9bb-3d27-429c-a8f2-904365f0820e
hl-color:: yellow
- Assume Branch Not Taken
ls-type:: annotation
hl-page:: 367
hl-color:: yellow
id:: 644bd898-be2d-4c8e-bec3-50823b9a7b28
- Exceptions
ls-type:: annotation
hl-page:: 374
hl-color:: yellow
id:: 644bfe19-3e1d-4c8b-9308-afa36426b843
collapsed:: true
- Exception Program Counter (EPC): save the address of the offending instruction hl-page:: 375 ls-type:: annotation id:: 644c8287-e87a-4b27-aa58-a48b8bbbedf9 hl-color:: yellow
- Reason for exception
- Cause register: holds a field which indicates the cause, and a single entry point is enough hl-page:: 376 ls-type:: annotation id:: 644c821f-5e4e-496e-900c-579241acdf55 hl-color:: yellow
- Vectored interrupts: different cause, different handler address hl-page:: 376 ls-type:: annotation id:: 644c8153-402f-49e9-a33e-ddfb342f09ca hl-color:: yellow
- Exceptions in a Pipelined Implementation
ls-type:: annotation
hl-page:: 376
hl-color:: yellow
id:: 644c81bd-46b7-455c-82a5-300e283528bd
- In addition to
IF.Flushwhich clears theIF/IDregister,ID.FlushandEX.Flushare added (as well as mux) to flush instructions after the offending instruction (including itself, because in most cases, the instruction is required to re-execute). - On exception detected, the pipeline flush these instructions, set next-PC to interrupt entry, save offending instruction's address + 4 to EPC as well as the Cause. All these things are done through combinational logic within the same cycle.
- Multiple exceptions can occur simultaneously in a cycle. The solution is to prioritize the exceptions.
hl-page:: 380
ls-type:: annotation
id:: 644cc90a-7230-4fef-92ba-4e4e8679589c
hl-color:: yellow
- And there are some bits in the Cause Register to indicate pending interrupts, so that the hardware can interrupt again after earlier ones get serviced.
- In addition to
- Parallelism via Instructions
ls-type:: annotation
hl-page:: 381
hl-color:: yellow
id:: 644cce08-6ba0-4c7f-8181-8453c74b4161
collapsed:: true
- Pipeline and Multiple-Issue
- Two primary responsibilities
hl-page:: 382
ls-type:: annotation
id:: 644cd47e-72cb-474f-9e9a-3d3078497428
hl-color:: yellow
collapsed:: true
- Packaging instructions into issue slots: choose the set of instructions to be issued this cycle hl-page:: 382 ls-type:: annotation id:: 644cd48f-6601-4c02-a0ee-b98f62d0360a hl-color:: yellow
- Dealing with data and control hazards: some kinds of hazards could be alleviated hl-page:: 382 ls-type:: annotation id:: 644cd494-d0c1-4529-8776-629f0ae97c5c hl-color:: yellow
- The Concept of Speculation
ls-type:: annotation
hl-page:: 382
hl-color:: yellow
id:: 644cd562-7690-4b77-9f18-39ebb9352846
collapsed:: true
- Recovery mechanisms: speculation may be incorrect
- Exception: what if the instruction executed by speculation raises an exception
- Static Multiple Issue
ls-type:: annotation
hl-page:: 383
hl-color:: yellow
id:: 644cd60e-9c4d-4851-bb99-6b4cbb757bb1
collapsed:: true
- issue packet: The set of instructions issued together in 1 cycle hl-page:: 383 ls-type:: annotation id:: 644cd795-5c31-4de1-841f-cb72e375cbc7 hl-color:: yellow
- Very Long Instruction Word (VLIW): a single instruction allowing several operations in certain pre-defined fields
hl-page:: 384
ls-type:: annotation
id:: 644cd983-c322-42e8-ad23-2e827defebd4
hl-color:: yellow
- Issue packet in static multiple issue can be seen as VLIW
- The compiler undertakes the work to eliminate hazards and optimize for branches.
- Extra resources are added to avoid structural hazards.
- Hazards are increased as the parallelism increases.
- use latency: Number of clock cycles between a
loadinstruction and an instruction that can use the result of the load without stalling the pipeline. hl-page:: 386 ls-type:: annotation id:: 644cdb59-f65c-4c2d-acc2-d6f9d170a82c hl-color:: yellow - In the 2-issue case, more instructions are prevented from using the result of
loadwithout stalling, instead of only the next instruction, in the 1-issue case. - An issue packet cannot be inter-dependent, since forwarding cannot take effect in such case.
- use latency: Number of clock cycles between a
- register renaming and anti-dependence
hl-page:: 387
ls-type:: annotation
id:: 644ce676-1db2-4399-9079-22b6f3fd4edf
hl-color:: yellow
- an ordering forced purely by the ==reuse of a name==, rather than a ==real data dependence== hl-page:: 387 ls-type:: annotation id:: 644ce786-f08e-4cae-914b-00a472d9a40c hl-color:: yellow
- Such dependence could be eliminated by simply add more registers
- Dynamic Multiple-Issue Processors
ls-type:: annotation
hl-page:: 388
hl-color:: yellow
id:: 644cddad-ba7d-49a8-98c6-135fd2f68e1c
collapsed:: true
- The simplest superscalar processor ==issue instructions in order== and ==hardware decides== whether 0, 1, or more instructions can issue in a given clock cycle.
hl-page:: 388
ls-type:: annotation
id:: 644ce000-671e-44ce-b00d-86ae035f3a14
hl-color:: yellow
- Compared with VLIW processors, superscalar processors guarantee correctness and fairly good performance for different compiler scheduling of code. In some VLIW designs, recompilation was required when moving across different processor models. hl-page:: 388 ls-type:: annotation id:: 644ce05f-7da8-4b02-b0df-b75533c76703 hl-color:: yellow
- Dynamic Pipeline Scheduling
ls-type:: annotation
hl-page:: 388
hl-color:: yellow
id:: 644ce149-5f2f-449d-8b6d-86d589aad536
- Dynamic pipeline scheduling ==chooses which instructions to execute== in a given clock cycle while trying to avoid hazards and stalls. This means that, instructions ==may not issue in order==. hl-page:: 388 ls-type:: annotation id:: 644ce15b-b779-4624-bce4-fd2ae0541d0d hl-color:: yellow
- Pipeline divided into 3 major units
hl-page:: 388
ls-type:: annotation
id:: 644ce1dc-44b6-40f1-ad39-c34e660c9b32
hl-color:: yellow
- Instruction fetch and issue unit: Deliver each instruction to a corresponding functional unit for execution
- Functional units: Each functional unit has reservation stations, holding its operands/operation. Once the reservation stations get ready, result is calculated. Finally, the result is send to commit unit, as well as other reservation stations depending on this result.
- Commit unit: Reorder buffer, holds the result until safe to deliver it to its destination.
- The combination of reorder buffer and reservation stations provides a form of register renaming.
- On issuing, all required data are buffered into reservation stations, from either RF or reorder buffer. After they are copied, these values can be freely overwritten.
- If an operand is not in RF or reorder buffer, then it would be directly bypassed from some Functional unit.
- Neither case requires re-use of a register in RF
- out-of-order execution ls-type:: annotation hl-page:: 390 hl-color:: yellow id:: 644ce915-4863-48fd-85f2-66e679623a7f
- in-order commit ls-type:: annotation hl-page:: 390 hl-color:: yellow id:: 644ce918-496d-430b-b631-212a7cf9b925
- Difficulty in sustaining full issue rate
ls-type:: annotation
hl-page:: 392
hl-color:: yellow
id:: 644ce9f1-d463-48b9-9ed7-98f218281b02
- In many applications, most dependencies cannot be alleviated (true data dependency).
- Memory hierarchy leads to stalls in memory reference instructions
- The simplest superscalar processor ==issue instructions in order== and ==hardware decides== whether 0, 1, or more instructions can issue in a given clock cycle.
hl-page:: 388
ls-type:: annotation
id:: 644ce000-671e-44ce-b00d-86ae035f3a14
hl-color:: yellow
- Energy Efficiency and Advanced Pipelining
hl-page:: 392
ls-type:: annotation
id:: 644ce2ec-c9bd-44a6-a51f-aaa9cf8a996b
hl-color:: yellow
- While the simpler processors are not as fast as their sophisticated brethren, they deliver better performance per joule hl-page:: 392 ls-type:: annotation id:: 644ce31a-b366-4d64-b5a3-1831372d21dc hl-color:: yellow
- Real Stuff: The ARM Cortex-A8 and Intel Core i7 Pipelines
ls-type:: annotation
hl-page:: 393
hl-color:: yellow
id:: 644cea96-46cb-48b6-bd1c-e3f43205e6cb
collapsed:: true
- ARM Cortex-A8 (ARMv7-A, 2005)
- Dynamic multiple issue, but without pipeline scheduling
- 3-stage fetch, 5-stage decode and 6-stage execution
- 2-level branch predictor: a 512-entry branch target buffer, a 4096-entry global history buffer, and an 8-entry return stack hl-page:: 394 ls-type:: annotation id:: 644cf0f0-aa33-425f-8a79-6b0d4a21d9ae hl-color:: yellow
- Intel Core i7 920 (1st Gen, 2008)
- microarchitecture: The internal organization of the processor hl-page:: 396 ls-type:: annotation id:: 644ced6c-a427-48bd-9ec2-31e925e0376f hl-color:: yellow
- Renames the architectural registers (specified in ISA) to a larger set of physical registers, by maintaining a map in between.
hl-page:: 396
ls-type:: annotation
id:: 644cee04-6588-4d06-becc-375612443e67
hl-color:: yellow
- Register renaming offers ==another approach to recovery from incorrect speculation==: simply undo the mappings that have occurred since the first incorrectly speculated instruction. hl-page:: 396 ls-type:: annotation id:: 644cef9a-7da3-441e-b1db-51045ca2d1d4 hl-color:: yellow
- ARM Cortex-A8 (ARMv7-A, 2005)
- Going Faster: Instruction-Level Parallelism and Matrix Multiply
ls-type:: annotation
hl-page:: 400
hl-color:: yellow
id:: 644cf33e-7f2f-444a-bf58-1107e02a4c84
- In addition to SIMD-AVX in Chapter3, combine loop-unrolling technique to utilize ILP
- See ((6444bdac-2bc0-490a-ac86-88c702b83c65))
- Fallacies and Pitfalls ls-type:: annotation hl-page:: 434 hl-color:: yellow id:: 644cf620-c517-42e0-958b-d2c58bbbcd09
- Word List 4
collapsed:: true
- anatomy 解剖学 ls-type:: annotation hl-page:: 321 hl-color:: green id:: 644a7236-aa52-48a8-8f9a-faa5ca23fa99
- filthy 肮脏的;淫秽的 hl-page:: 330 ls-type:: annotation id:: 644a8f8f-6a6e-400a-bc18-c48d24b3ab6e hl-color:: green
- stark 荒凉的(啥也没有);残酷的(残酷的现实); hl-page:: 333 ls-type:: annotation id:: 644a935b-84fe-4103-8915-a1df599e9915 hl-color:: green
- nonetheless 虽说如此 hl-page:: 334 ls-type:: annotation id:: 644a95de-ca6f-4036-b50a-629008ebd1ad hl-color:: green
- percolate 渗漏; hl-page:: 367 ls-type:: annotation id:: 644bed0a-1ca6-4433-bc2a-78c9e2a15a33 hl-color:: green
- tournament 锦标赛 ls-type:: annotation hl-page:: 373 hl-color:: green id:: 644bfdc9-5a90-451e-a203-745016f515fc
- clobber 狠击;惩罚; hl-page:: 377 ls-type:: annotation id:: 644c835e-e733-4552-9104-450e7190c367 hl-color:: green
- intermingled ls-type:: annotation hl-page:: 396 hl-color:: green id:: 644cec43-c523-4d68-b48d-c209e2223e18
-
Large and Fast: Exploiting Memory Hierarchy
ls-type:: annotation hl-page:: 451 hl-color:: yellow id:: 644cf7e2-0f30-4889-9edc-11ce3892141b -
Introduction
hl-page:: 453 ls-type:: annotation id:: 644dd4f0-dd17-4ca2-9360-413444e36db3 hl-color:: yellow collapsed:: true- principle of locality
ls-type:: annotation
hl-page:: 453
hl-color:: yellow
id:: 644dd503-3bcd-4ea2-a786-4083fa6b3162
- Temporal locality (locality in time): if an item is referenced, it will tend to be referenced again soon. ls-type:: annotation hl-page:: 453 hl-color:: yellow id:: 644dd50b-6aaa-43a1-8d27-9d28861bdcaf
- Spatial locality (locality in space): if an item is referenced, items whose addresses are close by will tend to be referenced soon ls-type:: annotation hl-page:: 453 hl-color:: yellow id:: 644dd512-c64d-4d7d-802b-c3defdbc2712
- memory hierarchy
ls-type:: annotation
hl-page:: 454
hl-color:: yellow
id:: 644dd51d-6572-412f-befc-e033934ac377
- a level closer to the processor is generally a subset of any level further away, and all the data is stored at the lowest level ls-type:: annotation hl-page:: 454 hl-color:: yellow id:: 644dd57e-591e-46e3-b782-07db59abfde9
- Block (or Line): the minimum data unit, either present or not, in the two-level hierarchy hl-page:: 455 ls-type:: annotation id:: 644dd5b6-a024-4b08-93ee-0b8a9f0bf8ec hl-color:: yellow
- His and Miss: data requested by processor found/un-found in some block in the upper level. And there is hit rate and miss rate (
=1-hit_rate) hl-page:: 455 ls-type:: annotation id:: 644dd61b-eaa7-4f0d-beab-103b48546ceb hl-color:: yellow - Hit time: the time to access the upper level memory, including the time needed to determine whether it is a hit or miss hl-page:: 455 ls-type:: annotation id:: 644dd84f-f268-4600-a025-ab1372e99e01 hl-color:: yellow
- Miss penalty: the time to replace a block in the upper level with the corresponding block from the lower level, plus the time to deliver the block to processor hl-page:: 455 ls-type:: annotation id:: 644dd859-a7c9-4058-b89f-6f638f885927 hl-color:: yellow
- principle of locality
ls-type:: annotation
hl-page:: 453
hl-color:: yellow
id:: 644dd503-3bcd-4ea2-a786-4083fa6b3162
-
Memory Technologies
ls-type:: annotation hl-page:: 457 hl-color:: yellow id:: 644dd921-308b-4b4d-9440-e082bb44cfae collapsed:: true- SRAM Technology
ls-type:: annotation
hl-page:: 458
hl-color:: yellow
id:: 644dd962-ebaf-4014-b8ec-4e24525359bb
- Single port R/W memory array; Fixed access time to any datum; Access time close to cycle time; 6-8 transistors per bit; Minimal power consumption
- DRAM Technology
ls-type:: annotation
hl-page:: 458
hl-color:: yellow
id:: 644dd9ad-2f6e-445e-b4e8-4e4b45eece09
- Value kept in a cell is stored as a charge in a ==capacitor==
hl-page:: 458
ls-type:: annotation
id:: 644ddaac-b1da-455d-848f-547ea2fb90cf
hl-color:: yellow
- A single transistor to access (R/W) this stored charge. Thus denser and cheaper
- The charge cannot be kept indefinitely and must periodically be refreshed, thus Dynamic RAM. hl-page:: 458 ls-type:: annotation id:: 644ddb68-1f5e-483b-a1a0-df3979e962d1 hl-color:: yellow
- To refresh the cell, read its contents and write it back. ls-type:: annotation hl-page:: 458 hl-color:: yellow id:: 644ddb9a-f0f6-4161-8105-6d74ecad0ece
- Organization of DRAM
- Two-level decoding structure: bits organized in rows, refresh ==an entire row== with a read cycle followed immediately by a write cycle. hl-page:: 458 ls-type:: annotation id:: 644ddfbe-a7cf-4048-b794-baae5d8891db hl-color:: yellow
- DRAMs buffer rows (with SRAM) for repeated access. hl-page:: 458 ls-type:: annotation id:: 644ddfe7-91ad-4c6b-9a67-075bf4eb6f33 hl-color:: yellow
- Multiple banks. Bank consists of a series of rows, each with its own row buffer, permitting R/W simultaneously with ==address interleaving==. hl-page:: 460 ls-type:: annotation id:: 644de669-6f2d-4853-83db-3b84b1f86a1e hl-color:: yellow
- DRAMs added clocks and are properly called Synchronous DRAMs. The clock makes memory and processor synchronize easily. Burst transfer. hl-page:: 458 ls-type:: annotation id:: 644de60d-55a9-44f8-b1c7-4f9848b62994 hl-color:: yellow
- Value kept in a cell is stored as a charge in a ==capacitor==
hl-page:: 458
ls-type:: annotation
id:: 644ddaac-b1da-455d-848f-547ea2fb90cf
hl-color:: yellow
- Flash Memory: Refer to ((643ba369-83df-42f9-9ee9-b45d4652e8fb)) hl-page:: 460 ls-type:: annotation id:: 644ddddd-fd99-4cd6-b551-3e4cf0221948 hl-color:: yellow
- Disk Memory: Refer to ((6437a4da-bca4-4f13-b018-30f3400d169f)) hl-page:: 460 ls-type:: annotation id:: 644ddde0-69a4-4dc6-8a66-3e802d201809 hl-color:: yellow
- SRAM Technology
ls-type:: annotation
hl-page:: 458
hl-color:: yellow
id:: 644dd962-ebaf-4014-b8ec-4e24525359bb
-
The Basics of Caches
hl-page:: 462 ls-type:: annotation id:: 644dd9cc-74f0-4484-8a39-a3715e2b9194 hl-color:: yellow collapsed:: true- Example Scenario: requests are each one word and the blocks also consist of a single word hl-page:: 463 ls-type:: annotation id:: 644deb2b-3128-4949-a7d8-8d812e57d1e5 hl-color:: yellow
- direct-mapped cache
hl-page:: 463
ls-type:: annotation
id:: 644decf0-302e-44bc-8ebe-0e195d66565d
hl-color:: yellow
- Each memory location is mapped to exactly one fixed location in cache.
- Mapping function
(Block address) mod (Number of cache blocks), and number of cache blocks is usually power of 2- More specifically, the block index should be
\frac{\text{Byte Address}}{\text{Bytes per Block}} \mod N
- More specifically, the block index should be
- tag
hl-page:: 463
ls-type:: annotation
id:: 644dee3d-839d-44c0-a772-b26b029987da
hl-color:: yellow
- Since many blocks may be mapped into the same cache block, tag field is added to the cache block, in order to ==identify whether a word in the cache corresponds to the requested word==. hl-page:: 463 ls-type:: annotation id:: 644deeaa-657f-43c4-a7c7-c39e34d21b8c hl-color:: yellow
- For example, for the direct-mapped cache above, use the upper bits of the RAM block address as tag.
- valid bit: whether an entry contains a valid address hl-page:: 465 ls-type:: annotation id:: 644def1f-26c4-4e81-b269-d1c341ec1ecc hl-color:: yellow
- The referenced address is divided into: Tag field (compare with cache entry) and Cache index (select the block) hl-page:: 467 ls-type:: annotation id:: 644df14a-7c1b-4f37-8c27-7bacb9d683d9 hl-color:: yellow
- Total cache size calculation
- Settings: 32-bit address, direct mapping,
2^nblocks in cache,2^{m+2}bytes block - Tag field size is
32 - (n + m + 2)- n bits for index, and m bits for the word, 2 bits for the byte offset
- Settings: 32-bit address, direct mapping,
- ==Larger blocks exploit spatial locality to lower miss rate==
hl-page:: 470
ls-type:: annotation
id:: 644df720-1c8b-4503-b63c-25eedb390c05
hl-color:: yellow
- Too large blocks (given a fixed total cache size), sacrificing the number of blocks, may increase miss rate. In addition, the miss penalty will go up and a large portion of the cache block will stay un-used before the block is swapped out.
- Usually, cache block is larger with higher memory bandwidth, since the miss penalty won't be much higher than smaller ones.
- Handling Cache Misses
ls-type:: annotation
hl-page:: 471
hl-color:: yellow
id:: 644df84e-9877-44a2-8caf-09b4145b6616
- Settings: In-order pipeline, instruction cache miss (data cache almost the same)
-
- ==Send the original PC== value (PC – 4) to the memory.
- Instruct memory to perform a read and ==wait==.
- ==Write the cache entry== (write data field, put upper bits of the address into tag field, and set valid bit)
- Restart the instruction execution at the first step, which will ==re-fetch== the instruction
- Handling Writes
ls-type:: annotation
hl-page:: 472
hl-color:: yellow
id:: 644dfa55-2147-43e0-92c4-b6eff3c77fd4
- write-through
ls-type:: annotation
hl-page:: 472
hl-color:: yellow
id:: 644dfb90-ed84-4674-9b19-933cc0d6452f
collapsed:: true
- Write the data both to cache and to memory
- On write miss, first fetch the block from memory, then write the word to cache and memory. This is called write allocate policy.
hl-page:: 472
ls-type:: annotation
id:: 644dfbc5-1c5c-438b-a90d-ff3d1de00f5d
hl-color:: yellow
- no write allocate: only write to memory, without fetching the to-be-written block hl-page:: 473 ls-type:: annotation id:: 644dfe25-7d59-43b0-bf19-3a3a1c46ff0b hl-color:: yellow
- write buffer: The processor write to cache and the buffer, instead of to memory, and then goes on. Meanwhile, the write buffer writes the buffered data to memory. hl-page:: 473 ls-type:: annotation id:: 644dfcad-d2f6-4e50-8b65-f539c7a54f7a hl-color:: yellow This scheme would still ==fail when the processor generates writes faster== than the memory could accept.
- write-back
ls-type:: annotation
hl-page:: 473
hl-color:: yellow
id:: 644dfc5f-ed37-45c4-930e-f6457dac22d8
collapsed:: true
- Write only to cache. Write to memory not until the block is replaced
- Implementing write-back strategy is more complex than write-through.
hl-page:: 473
ls-type:: annotation
id:: 644dfedb-3afe-4afb-9b56-6ac3c2067783
hl-color:: yellow
- Write to write-through cache can be done in 1 cycle (with write buffer). We can write to cache block and read the tag in the same cycle, if tag doesn't match, we can safely abandon the block and issue a read to memory, since the memory always has a copy for that block
- Write to write-back cache needs 2 cycles. We must read the tag in the 1st cycle, and in the 2nd cycle, if the tag doesn't match, issue a write miss instead; otherwise, write. We cannot write directly in the first cycle, because if it doesn't match, the original block may be destroyed (i.e. a dirty block) without a backup in memory.
- Nonetheless, another buffer can be added to alleviate the problem for write-back cache
- write-through
ls-type:: annotation
hl-page:: 472
hl-color:: yellow
id:: 644dfb90-ed84-4674-9b19-933cc0d6452f
collapsed:: true
-
Measuring and Improving Cache Performance
hl-page:: 477 ls-type:: annotation id:: 644de81f-f1ed-42b6-8b51-d6e72b8162bd hl-color:: yellow collapsed:: true- Performance metrics
collapsed:: true
- CPU time (again), can be divided into 2 parts
hl-page:: 478
ls-type:: annotation
id:: 644e0592-d4f8-4c4c-ab53-67473f40e1a9
hl-color:: yellow
\text{CPU time} = (\text{CPU-execution clock cycles} + \text{Memory-stall clock cycles}) \times \text{Clock cycle time}
- Memory-stall clock cycles
hl-page:: 478
ls-type:: annotation
id:: 644e0696-a447-4ddd-9bd3-dca459b91853
hl-color:: yellow
- Primarily caused by cache misses, thus ignore other factors. And we have read misses and wirte misses
\text{Memory-stall clock cycles} = \text{Read-stall cycles} + \text{Write-stall cycles}- Write (write-through) has two sources of stalls: write misses and write buffer stalls. Fortunately, in well designed modern processors, write buffer capacity is usually enough thus such stall can be neglected. hl-page:: 478 ls-type:: annotation id:: 644e084a-1d64-423e-9083-ebc373cbb373 hl-color:: yellow
\text{Memory-stall cycles} = \frac{\text{Memory accesses}}{\text{Program}} \times \text{Miss rate} \times \text{Miss pennalty}
- The example problem: Calculating Cache Performance
hl-page:: 479
ls-type:: annotation
id:: 644e09e3-eb95-42e2-a1f9-f0cf60fecaa2
hl-color:: yellow
\text{CPI}_{\text{stall}} = \text{CPI}_{\text{perfect}} + (\text{ICache miss rate} + \text{DCache miss rate} \times \text{Mem-ref ratio}) \times \text{Miss penalty}
- Average memory access time (AMAT). Hit time should also be taken into count.
hl-page:: 481
ls-type:: annotation
id:: 644e0b77-e7b1-4e9c-ab3f-ce39337117e3
hl-color:: yellow
\text{AMAT} = \text{Time for a hit} + (\text{Miss rate} \times \text{Miss penalty})
- CPU time (again), can be divided into 2 parts
hl-page:: 478
ls-type:: annotation
id:: 644e0592-d4f8-4c4c-ab53-67473f40e1a9
hl-color:: yellow
- More Flexible Placement of Blocks
ls-type:: annotation
hl-page:: 481
hl-color:: yellow
id:: 644e0c19-d858-42e5-bfdb-223ff5ebba0d
- Direct mapped: block can be placed in a fixed location, decided by its address in memory.
- Fully Associative: block can be placed in ==any location== in the cache (can be associated with any entry in cache)
hl-page:: 482
ls-type:: annotation
id:: 644e4d64-9421-4a8c-81da-fadceab42bb3
hl-color:: yellow
- To find a block, search is done ==in parallel with a comparator== associated with each cache entry, which is costly in hardware. Such scheme is suitable for small caches. hl-page:: 482 ls-type:: annotation id:: 644e4ddb-92f5-4356-9d47-5b85740a265f hl-color:: yellow
- Set Associative: each block can be placed a fixed number of locations
hl-page:: 482
ls-type:: annotation
id:: 644e4e73-39fe-4103-bada-1a4e86531b6a
hl-color:: yellow
- An n-way set-associative cache has a number of sets of n blocks. Each memory block maps to a unique cache set given by the index field, and can be placed in any block of that set. hl-page:: 482 ls-type:: annotation id:: 644e5139-a913-4f1f-bbdd-2d577612cdea hl-color:: yellow
\text{Set Index} = \text{Block\_number} \mod \text{Number\_of\_Sets}- Similar to fully-associative case, we search ==all the tags of a set==. hl-page:: 483 ls-type:: annotation id:: 644e52a9-4e90-4437-8996-7853fc809075 hl-color:: yellow
- Direct mapping can be seen as 1-way set associative, with the set size of 1. Fully associative can be seen as N-way set associative, wit set size equal to total block number.
- Going from 1-way to 2-way associativity decreases the miss rate by about 15%, but there is little further improvement in going to higher associativity. hl-page:: 485 ls-type:: annotation id:: 644e5618-0945-4d89-a683-8cd52d59afcd hl-color:: yellow
- The three portions of an address:
Tag | Index | Block offset. hl-page:: 486 ls-type:: annotation id:: 644e56e1-7ffd-40fe-beed-5235db719d5a hl-color:: yellow- For full-associative cache, the
Indexfield is 0-bit wide - The higher associative, the more bits in
Tagfield, and thus the higher extra hardware cost
- For full-associative cache, the
- Generally, we use LRU scheme to choose which block to replace
hl-page:: 488
ls-type:: annotation
id:: 644e56c6-038e-47e7-a076-0f1841c1aa14
hl-color:: yellow
- As associativity increases, implementing LRU gets harder; ls-type:: annotation hl-page:: 488 hl-color:: yellow id:: 644e5979-bafa-4d26-90ba-9c98218d5f04
- In a 2-way set-associative cache, LRU can e implemented by keeping a single bit in each set, which indicates which element is referenced.
- Reducing the Miss Penalty Using Multilevel Caches
ls-type:: annotation
hl-page:: 489
hl-color:: yellow
id:: 644e598a-08a7-49f3-966b-e994a472e499
- The design considerations for a primary and secondary cache are different.
hl-page:: 490
ls-type:: annotation
id:: 644e5aff-0474-4748-afcb-ae57e4dd7292
hl-color:: yellow
- Primary cache focuses on hit time (since penalty is not that terrible, with secondary cache), while secondary cache focuses on miss rate.
- Primary caches are often smaller and use smaller block size, while secondary caches are larger and use larger block size with higher associativity.
- global miss rate: the fraction of references that missed in all cache levels
hl-page:: 495
ls-type:: annotation
id:: 644e5faf-a110-4d73-bb7f-74376764029b
hl-color:: yellow
- local miss rate: miss rate for each level per se.
- Each level's miss rate is higher than global miss rate, but when they get multiplied together, the frequency of RAM access is reduced greatly
- The design considerations for a primary and secondary cache are different.
hl-page:: 490
ls-type:: annotation
id:: 644e5aff-0474-4748-afcb-ae57e4dd7292
hl-color:: yellow
- Software Optimization via Blocking ls-type:: annotation hl-page:: 492 hl-color:: yellow id:: 644e5adc-7d6c-4136-8fa7-7849417949da
- Performance metrics
collapsed:: true
-
Dependable Memory Hierarchy
ls-type:: annotation hl-page:: 497 hl-color:: yellow id:: 644dea78-9ad9-4322-a05d-1b30789b85a8 collapsed:: true- Two states of delivered service: Service accomplishment & Serviced interruption, which means delivered service is the same as/different from specified service
hl-page:: 497
ls-type:: annotation
id:: 644f4473-da2b-4f3c-8d32-89b90b80f0b2
hl-color:: yellow
- Transitions from state 1 to state 2 are caused by failures, and transitions from state2 to state 1 are called restorations. hl-page:: 497 ls-type:: annotation id:: 644f4518-efff-41a8-8445-5b9e2434e9cd hl-color:: yellow
- Failures can be permanent or intermittent. ls-type:: annotation hl-page:: 497 hl-color:: yellow id:: 644f4535-12cc-4669-bdbe-4d72b5066fae
- Reliability is a measure of the continuous service accomplishment.
hl-page:: 497
ls-type:: annotation
id:: 644f456f-e4bb-44be-8352-ad557cd80060
hl-color:: yellow
- mean time to failure (MTTF) and annual failure rate (AFR)
hl-page:: 497
ls-type:: annotation
id:: 644f4583-1a89-4136-b6e8-ae34cd452815
hl-color:: yellow
- How long you can use it before it fails
- Service interruption is measured as mean time to repair (MTTR).
hl-page:: 498
ls-type:: annotation
id:: 644f45a7-aaf5-4489-bce0-cb66c22f42cf
hl-color:: yellow
- How long to repair it after it fails
- Mean time between failures (MTBF) is simply the sum of MTTF + MTTR ls-type:: annotation hl-page:: 498 hl-color:: yellow id:: 644f4623-bf70-495e-93f7-358fc369d438
- mean time to failure (MTTF) and annual failure rate (AFR)
hl-page:: 497
ls-type:: annotation
id:: 644f4583-1a89-4136-b6e8-ae34cd452815
hl-color:: yellow
- Availability is a measure of service accomplishment with respect to the alternation between accomplishment and interruption. $\text{Avail}=\frac{\text{MTTF}}{\text{MTTF+MTTR}}$ hl-page:: 498 ls-type:: annotation id:: 644f46aa-3ee6-4b4c-95c5-74c326fd758e hl-color:: yellow
- To improve MTTF: Fault avoidance, Fault tolerance, Fault forecasting hl-page:: 498 ls-type:: annotation id:: 644f4720-89b2-4fd5-813f-1a9f18579745 hl-color:: yellow
- Hamming Code
- Hamming distance: the ==minimum number of different bits== between any two ==correct== bit patterns hl-page:: 499 ls-type:: annotation id:: 644f47cb-b392-4558-a654-ce01bcf35209 hl-color:: yellow
- Parity code: Count the number of 1s in a word, and append a parity bit when written into memory, say 1 for odd 1s and 0 for even 1s. Therefore, the parity of the N+1 bit word should always be even, otherwise, there must be an error.
hl-page:: 499
ls-type:: annotation
id:: 644f48cc-4cf6-4eb8-96e5-93812335f425
hl-color:: yellow
- Actually, a 1-bit parity scheme can detect any odd number of errors; however, the probability of having 3 or 5 errors is much lower than the probability of having 2 hl-page:: 499 ls-type:: annotation id:: 644f497d-9fb1-4ec1-b8a0-845652868073 hl-color:: yellow
- Hamming Error Correction Code (ECC)
ls-type:: annotation
hl-page:: 499
hl-color:: yellow
id:: 644f49ab-9ec8-46f0-9dcb-86eb1da64df2
- Add extra parity bits to identify the position of the single-bit error, and the word with parity bits has a Hamming distance of 3.
- For 8-bit data word, we need 4 parity bits, respectively placed at the 1,2,4,8th bit of the 12-bit code.
- For larger words, the number of parity bits can be calculated as
p \ge \log_2(p+d+1)
- For larger words, the number of parity bits can be calculated as
- See FIGURE 5.24 for detail hl-page:: 500 ls-type:: annotation id:: 644f4df8-30ce-44fd-bc9f-973ccbb7fd6f hl-color:: yellow
- Single error correction and Double error detection
hl-page:: 501
ls-type:: annotation
id:: 644f4e40-7352-4c4a-9a9c-5bdfb79ada0a
hl-color:: yellow
- Make the code's hamming distance 4, by adding a parity bit for the whole code word.
- Also, see the textbook for detail
- Two states of delivered service: Service accomplishment & Serviced interruption, which means delivered service is the same as/different from specified service
hl-page:: 497
ls-type:: annotation
id:: 644f4473-da2b-4f3c-8d32-89b90b80f0b2
hl-color:: yellow
-
Virtual Machines
ls-type:: annotation hl-page:: 503 hl-color:: yellow id:: 644e0550-6991-4b86-9765-612e71bed268 collapsed:: true- System Virtual Machines: run the same ISA as the native hardware
hl-page:: 503
ls-type:: annotation
id:: 644f511d-3a0f-4204-acd0-40eca3d58610
hl-color:: yellow
- A single computer, running multiple VMs, supporting multiple OSs and they share the hardware
- virtual machine monitor (VMM) or hypervisor hl-page:: 503 ls-type:: annotation id:: 644f5195-6bdf-4db1-9e4a-5b4dbc959df6 hl-color:: yellow
- host and guest ls-type:: annotation hl-page:: 503 hl-color:: yellow id:: 644f51a9-f75b-4d73-91dd-89fbaa91ffe9
- protection, managing software, managing hardware
- The cost of processor virtualization depends on the workload.
hl-page:: 504
ls-type:: annotation
id:: 644f5241-37dc-4efa-acc9-9ae8739b7e19
hl-color:: yellow
- User-level processor-bound programs have 0 virtualization overhead, native speed hl-page:: 504 ls-type:: annotation id:: 644f5287-cac6-46c8-ad60-8cee1ca0c57c hl-color:: yellow
- I/O-intensive workloads, also OS-intensive, have high virtualization overhead. As many system calls and privileged instructions. hl-page:: 504 ls-type:: annotation id:: 644f52a8-98fb-466b-92cf-7c0f942382b2 hl-color:: yellow
- Requirements of a Virtual Machine Monitor
ls-type:: annotation
hl-page:: 505
hl-color:: yellow
id:: 644f52db-8deb-4e8b-8a11-e59b2bbfbe26
- Guest software should behave on a VM exactly as if it were running on the native hardware, ls-type:: annotation hl-page:: 505 hl-color:: yellow id:: 644f533e-a31c-4916-bb47-2c97eae78310
- Guest software should not be able to change allocation of real system resources directly. ls-type:: annotation hl-page:: 505 hl-color:: yellow id:: 644f5340-0e7e-4ea2-b337-2779d4092da5
- At least two processor modes, system and user. ls-type:: annotation hl-page:: 505 hl-color:: yellow id:: 644f5401-cfe8-439b-af2e-d91bc6dae442
- A privileged subset of instructions that is available only in system mode; all system resources must be controllable only via these instructions. ls-type:: annotation hl-page:: 505 hl-color:: yellow id:: 644f5403-7aab-4e4e-ae71-87aa2d22bbbd
- System Virtual Machines: run the same ISA as the native hardware
hl-page:: 503
ls-type:: annotation
id:: 644f511d-3a0f-4204-acd0-40eca3d58610
hl-color:: yellow
-
Virtual Memory
ls-type:: annotation hl-page:: 506 hl-color:: yellow id:: 644f4178-a1ca-40e3-b488-3ce044954e46 collapsed:: true- Making Address Translation Fast: the TLB
ls-type:: annotation
hl-page:: 517
hl-color:: yellow
id:: 644f5d06-3450-4d12-96e5-374712790d9e
collapsed:: true
- TLB entry
- Tag holds a portion of the VPN, and data entry holds a PPN. hl-page:: 517 ls-type:: annotation id:: 644f5d6e-0c4e-48ad-9284-7b6247cc3d15 hl-color:: yellow
- TLB will need to include other status bits, such as the dirty and the reference bits. hl-page:: 518 ls-type:: annotation id:: 644f5d87-be05-4020-a8e8-d23c4a0fcfa4 hl-color:: yellow
- Cache-like considerations
- Write policy: write-back, since the miss rate is low
- Associativity: decided by the number of TLB entries
- Replacement: many systems provide some support for random replacement, because LRU are expensive to implement especially by hardware hl-page:: 518 ls-type:: annotation id:: 644f5f4f-8fce-4a04-9ccc-eb988f0e3ed3 hl-color:: yellow
- The Intrinsity FastMATH TLB
ls-type:: annotation
hl-page:: 519
hl-color:: yellow
id:: 644f60f0-680e-4aa0-a0de-165342178045
- 4K page and 32-bit AS lead to 20-bit VPN. Thus 20-bit tag, 20-bit data, and some other bookkeeping bits
- 16-way fully associative TLB, compare VPN against all tags
- On TLB miss, MIPS hardware saves the page number of the reference in a special register and raise exception.
hl-page:: 519
ls-type:: annotation
id:: 644f61e2-eb58-44d5-8b89-b55c3708f25a
hl-color:: yellow
- The hardware maintains an index that indicates the recommended entry to replace, which is chosen randomly. hl-page:: 519 ls-type:: annotation id:: 644f62bd-6a29-40bc-a634-15fed6bc1ec9 hl-color:: yellow
- TLB miss routine in OS indexes the table with relevant registers.
- Using a special set of system instructions that can update the TLB, OS handler set replace an entry with the new entry fetched from memory hl-page:: 519 ls-type:: annotation id:: 644f6250-82e1-4286-b25a-e4f4eb0eb622 hl-color:: yellow
- A true page fault occurs if the PTE does not have a valid PA hl-page:: 519 ls-type:: annotation id:: 644f62a6-7b6d-4434-bc3d-233892af468f hl-color:: yellow
- TLB entry
- Integrating Virtual Memory, TLBs, and Caches
ls-type:: annotation
hl-page:: 519
hl-color:: yellow
id:: 644f6050-cd77-4d32-afdb-3bf8a90302b7
collapsed:: true
- FIGURE 5.31 Processing a read or a write-through in the Intrinsity FastMATH TLB and cache. ls-type:: annotation hl-page:: 521 hl-color:: yellow id:: 644f60d4-72c9-41e9-b976-d1ed1c07d830
- Cache and TLB
- In the simplest case, all memory addresses are translated to PAs before the cache access. This is slow. hl-page:: 522 ls-type:: annotation id:: 644f6370-8fd9-4a5e-92e3-09b74f6baa04 hl-color:: yellow
- Virtually addressed cache: uses tags from VAs, VIVT, to avoid using TLB before cache access (unless a TLB miss), which takes TLB out of critical path.
hl-page:: 522
ls-type:: annotation
id:: 644f6397-14d2-491c-9cef-bbfd6e5cca8f
hl-color:: yellow
- Aliasing occurs when the same object has two names. In this case, 2 VAs for the same page. Then one physical page has two copies in cache, introducing consistency issue. hl-page:: 523 ls-type:: annotation id:: 644f6434-eb87-405e-9190-4f67466af467 hl-color:: yellow
- Common compromise: virtually indexed but physically tagged, VIPT. Use the page offset portion of the address, which is identical in VA and PA, as the index of cache. hl-page:: 523 ls-type:: annotation id:: 644f6491-3e52-4db7-a824-5988e124ee1c hl-color:: yellow
- Implementing Protection with Virtual Memory (Trivial)
hl-page:: 523
ls-type:: annotation
id:: 644f65cf-03a5-4132-b091-e08b9ce9002a
hl-color:: yellow
collapsed:: true
- Requirement for Hardware
- Support at least 2 priority levels of process
- Provide a portion of the processor state that a user process can read but not write. ls-type:: annotation hl-page:: 523 hl-color:: yellow id:: 644f65f9-38a8-4ead-90f5-494c0b2e3191
- Provide mechanisms whereby the processor can go from user mode to supervisor mode and vice versa. In MIPS,
syscallanderethl-page:: 524 ls-type:: annotation id:: 644f660f-7534-4124-9b7e-32cd2ce2c77a hl-color:: yellow
- Requirement for Hardware
- Handling TLB Misses and Page Faults
ls-type:: annotation
hl-page:: 525
hl-color:: yellow
id:: 644f65b8-c079-42ce-a6b6-e1c56dbf45e9
- TLB miss can indicate one of two possibilities: Page in memory and True page fault hl-page:: 525 ls-type:: annotation id:: 644f67c6-de09-4546-b3a6-cbb286cca5bb hl-color:: yellow
- MIPS handles a TLB miss in software. It brings in PTE from memory and then re-executes the instruction that caused the TLB miss. Upon re-executing, it will get a TLB hit.
hl-page:: 525
ls-type:: annotation
id:: 644f6897-2635-4cb9-ae6b-2515dc05694b
hl-color:: yellow
- We must prevent load/store from actually completing when there is an exception, when implementing the pipeline hl-page:: 526 ls-type:: annotation id:: 644f68df-1f9b-4430-a2ea-09ee1ca7361f hl-color:: yellow
- General process of TLB handling
- Upon a TLB miss, the MIPS hardware saves the page number of the reference in
BadVAddrand generates exception. hl-page:: 527 ls-type:: annotation id:: 644f6974-151b-46fa-afe2-ceed4ec48722 hl-color:: yellow - Control is transferred to address
0x8000_0000, the location of the TLB miss handler. hl-page:: 528 ls-type:: annotation id:: 644f69a6-31f7-4f57-8c78-75a19f0d83fd hl-color:: yellow - MIPS hardware places everything you need in the
Contextregister:base of page table | VPNhl-page:: 528 ls-type:: annotation id:: 644f69dd-753d-4df6-b0fc-4f112a12f0ce hl-color:: yellow - Since TLB misses are quite frequent, the handler is simple
TLBmiss: mfc0 $k1,Context # copy address of PTE into temp $k1 lw $k1,0($k1) # put PTE into temp $k1 mtc0 $k1,EntryLo # put PTE into special register EntryLo tlbwr # put EntryLo into TLB entry at Random eret # return from TLB miss exception- TLB miss routine does not check validity of the page, but rather directly loads the PTE into TLB. If it is a true page fault, let the processor generate the Page Fault exception.
- TLB miss routine does not save process state, since it doesn't need to change anything
- TLB miss has a separate handler address, saving the effort to diagnose exception type from
Causeregister
- Upon a TLB miss, the MIPS hardware saves the page number of the reference in
- The process of handling a page fault is laborious. See FIGURE 5.34 for detail. hl-page:: 530 ls-type:: annotation id:: 644f6cf2-142f-4b34-a4c1-06de2a87879a hl-color:: yellow
- To avoid the problem of a page fault during this low-level exception code, MIPS sets aside a portion of its address space that cannot have page faults, called
unmapped. hl-page:: 529 ls-type:: annotation id:: 644f6d39-ad59-4a15-81a9-8dede8bef344 hl-color:: yellow - Aside: The VMM implements memory control by maintaining a shadow page table. See textbook for detail hl-page:: 529 ls-type:: annotation id:: 644f6d8d-7c2f-4cb7-b689-eb282d35f411 hl-color:: yellow
- Making Address Translation Fast: the TLB
ls-type:: annotation
hl-page:: 517
hl-color:: yellow
id:: 644f5d06-3450-4d12-96e5-374712790d9e
collapsed:: true
-
A Common Framework for Memory Hierarchy
ls-type:: annotation hl-page:: 533 hl-color:: yellow id:: 644e610b-15ef-47cd-8701-7bf7be87f16c collapsed:: true- Question 1: Where Can a Block Be Placed?
ls-type:: annotation
hl-page:: 534
hl-color:: yellow
id:: 644f6e31-ed3f-4bbf-8f30-53fe28e9e518
- This entire range of schemes can be thought of as variations on a set-associative scheme where the number of sets and the number of blocks per set varies hl-page:: 534 ls-type:: annotation id:: 644f6e76-bf1e-4def-b9c0-75f1faf3b5c6 hl-color:: yellow
- As cache size grows, the benefit of increasing associativity is slight.
- Question 2: How Is a Block Found?
ls-type:: annotation
hl-page:: 535
hl-color:: yellow
id:: 644f6ec3-e710-4abf-94a2-9f0c362e40e8
- VM uses full associative, while cache/TLB uses set associative
- Question 3: Which Block Should Be Replaced on a Cache Miss?
ls-type:: annotation
hl-page:: 536
hl-color:: yellow
id:: 644f6f0b-c5e3-4c38-90ec-67e3118b5c3d
- For hardware, random or approximated LRU, since they are simple to build
- In virtual memory, some form of LRU is always approximated, in that miss penalty is significant hl-page:: 536 ls-type:: annotation id:: 644f6f9a-a898-4b20-88c0-588c940ec793 hl-color:: yellow
- Question 4: What Happens on a Write?
ls-type:: annotation
hl-page:: 536
hl-color:: yellow
id:: 644f6fd5-59c8-41e3-8ede-402475ac6a72
- A summary of pros and cons for Write-back and Write-through
- When write latency is unacceptable, we tend to use Write-back
- The Three Cs: An Intuitive Model for Understanding the Behavior of Memory Hierarchies
ls-type:: annotation
hl-page:: 538
hl-color:: yellow
id:: 644f6ffe-256a-4252-af5b-5a2a8d03fc2e
- Compulsory, Capacity and Conflict misses
- Note that, compulsory misses can be reduced by increasing the block size
- Question 1: Where Can a Block Be Placed?
ls-type:: annotation
hl-page:: 534
hl-color:: yellow
id:: 644f6e31-ed3f-4bbf-8f30-53fe28e9e518
-
Using a Finite-State Machine to Control a Simple Cache
ls-type:: annotation hl-page:: 540 hl-color:: yellow id:: 644f418c-dea9-442d-9f38-a60aef123e15 collapsed:: true- Settings: Direct-mapped, Write-back and write-allocate, 16-Byte block, 16-KB total
- 10-bit Index, 4-bit Offset, 18-bit Tag
- Signals: 2 sets of signals, respectively CPU-Cache and Cache-Memory
- Read/Write, Valid, Address, Read data, Write data, Ready
- FSM for a Simple Cache Controller
ls-type:: annotation
hl-page:: 543
hl-color:: yellow
id:: 644f731a-a08e-4b86-ace7-800cf640accb
- Idle: Waits for a valid CPU request
- Compare Tag
- Test hit or miss: Select block with the index field, and then compare the tag, last check valid.
- If hit, set
CacheReadyand go back to Idle. If write, set the dirty bit; If read, set theReadData - If miss, set tag of the victim block to required address. Then go to Allocate or Write-back, determined by the dirty bit
- Note that, this state can be split into compare and access, to reduce latency
- Write-back: Writes a block to Memory (and wait for Memory's write ready). On completion, go to Allocate (since we have to wait for new data).
- Allocate: Waits for the Ready signal from Memory. On completion, goes to Compare Tag.
- Settings: Direct-mapped, Write-back and write-allocate, 16-Byte block, 16-KB total
-
Parallelism and Memory Hierarchy: Cache Coherence
ls-type:: annotation hl-page:: 545 hl-color:: yellow id:: 644f41a2-4c7b-4b27-b75e-ea06924fa51e- Coherence defines what values can be returned by a read
hl-page:: 545
ls-type:: annotation
id:: 644f76e2-d5bc-47b0-90dd-c7ff33f07f13
hl-color:: yellow
- A memory system is coherent if:
hl-page:: 545
ls-type:: annotation
id:: 644f77a6-d507-4b1a-bc25-7f42648fd943
hl-color:: yellow
- A read by a processor P to a location X that follows a write by P to X, with no writes of X by another processor occurring between the write and the read by P, always returns the value written by P. ls-type:: annotation hl-page:: 545 hl-color:: yellow id:: 644f77b5-83a5-4fb1-b3a6-3a2db489da39
- A read by a processor to location X that follows a write by another processor to X returns the written value if the read and write are sufficiently separated in time and no other writes to X occur between the two accesses. ls-type:: annotation hl-page:: 545 hl-color:: yellow id:: 644f77c2-e346-4ffc-bb20-a1b3933fe9fc
- Writes to the same location are serialized; that is, two writes to the same location by any two processors are seen in the same order by all processors. ls-type:: annotation hl-page:: 546 hl-color:: yellow id:: 644f77ca-4b61-4296-8a04-a7781102a0a2
- A memory system is coherent if:
hl-page:: 545
ls-type:: annotation
id:: 644f77a6-d507-4b1a-bc25-7f42648fd943
hl-color:: yellow
- Consistency determines when a written value will be returned by a read. hl-page:: 545 ls-type:: annotation id:: 644f76e7-a61d-4f42-9c8a-adc7eb920a8c hl-color:: yellow
- Basic Schemes for Enforcing Coherence
ls-type:: annotation
hl-page:: 546
hl-color:: yellow
id:: 644f7883-18c3-43d5-a7ed-51fac0fb3d83
- Migration: A data item can be moved to a local cache and used there in a transparent fashion. ls-type:: annotation hl-page:: 546 hl-color:: yellow id:: 644f78a3-7065-435c-8ee6-21ae194c1961
- Replication: When shared data are being ==simultaneously read==, the caches make a copy of the data item in the local cache. ls-type:: annotation hl-page:: 547 hl-color:: yellow id:: 644f78b7-c9fa-4854-824d-a8d0952ad3be
- cache coherence protocols, tracking the state of any sharing of a data block hl-page:: 547 ls-type:: annotation id:: 644f7956-11ad-4140-b966-6d2061ae6fb2 hl-color:: yellow
- snooping
ls-type:: annotation
hl-page:: 547
hl-color:: yellow
id:: 644f7a3d-783f-49d2-b697-91200284b2c8
- For each block, the caches which have this block, hold both the copy of data and a copy of the sharing status of this block.
- caches are accessible via some broadcast medium hl-page:: 547 ls-type:: annotation id:: 644f7ab7-b97c-41a9-8780-e9281ab19565 hl-color:: yellow
- cache controllers monitor (snoop) on the medium, to determine whether they have a copy of a block that is requested on a bus or switch access. hl-page:: 547 ls-type:: annotation id:: 644f7ad5-2103-4b23-a6c0-0417d431eb02 hl-color:: yellow
- write invalidate protocol: invalidate copies in other caches on a write. When others want to read/write later, they miss and fetch from memory. When others want to write simultaneously, there is a race and there is always a winner. hl-page:: 547 ls-type:: annotation id:: 644f797d-e82e-49fc-97b0-df7034494d5f hl-color:: yellow
- ensure that a processor has exclusive access to a data item before it writes that item
- Coherence defines what values can be returned by a read
hl-page:: 545
ls-type:: annotation
id:: 644f76e2-d5bc-47b0-90dd-c7ff33f07f13
hl-color:: yellow
- Parallelism and the Memory Hierarchy: Redundant Arrays of Inexpensive Disks
hl-page:: 550
ls-type:: annotation
id:: 644f41d0-39d8-476b-8d62-08e2bf5979ef
hl-color:: yellow
collapsed:: true
- Refer to ((6437e8b0-b179-46c1-9173-e9b080273f7e))
-
Real Stuff: The ARM Cortex-A8 and Intel Core i7 Memory Hierarchies
ls-type:: annotation hl-page:: 557 hl-color:: yellow id:: 644e0503-5ee5-4da8-bf62-04b2fa91e5f9 collapsed:: true- To support multiple issue (fetch more instructions per cycle), a popular technique is to break the cache into banks and allow multiple, independent, parallel accesses (if the addresses are in different banks). hl-page:: 558 ls-type:: annotation id:: 644fae8b-080a-4bed-8e02-08519a34b547 hl-color:: yellow
- nonblocking cache: Hit under miss allows additional cache hits during a miss, while miss under miss allows multiple outstanding cache misses. hl-page:: 558 ls-type:: annotation id:: 644faeff-a7cf-4a48-b257-50e090ca62be hl-color:: yellow
- Core i7 has prefetch mechanism for data accesses hl-page:: 559 ls-type:: annotation id:: 644faf31-abe7-4938-b43a-15ca77c5a20a hl-color:: yellow
- Going Faster: Cache Blocking and Matrix Multiply ls-type:: annotation hl-page:: 561 hl-color:: yellow id:: 644e053c-0254-41ad-ba8c-25527b64a217
-
Fallacies and Pitfalls
ls-type:: annotation hl-page:: 564 hl-color:: yellow id:: 644deab8-5929-424b-a0bb-9944268d757e- Pitfall: Having less set associativity for a shared cache than the number of cores or threads sharing that cache. ls-type:: annotation hl-page:: 565 hl-color:: yellow id:: 644fb0df-1bf0-479c-bb16-de1b6729faa4
- Word List 5
- rug 小块地毯 hl-page:: 462 ls-type:: annotation id:: 644dded8-ffe3-4d2e-abbe-d4374b5b30af hl-color:: green
- proximity (时间或空间)接近,邻近,靠近: ls-type:: annotation hl-page:: 478 hl-color:: green id:: 644e0631-3612-4778-86c2-e6f062b426ab
- incur 招致,引发;蒙受 hl-page:: 489 ls-type:: annotation id:: 644e59e3-12be-418f-bd07-bb228ec2a8ac hl-color:: green
- Voila 〈法〉那就是,瞧(表示事情成功或满意之感叹词用语) hl-page:: 501 ls-type:: annotation id:: 644f4c11-76c4-4c50-8278-f5a18415cad6 hl-color:: green
- culprit 罪犯;肇事者 hl-page:: 505 ls-type:: annotation id:: 644f5381-dd30-4674-b37e-ce080e71a4a2 hl-color:: green
- duality 二元性; hl-page:: 515 ls-type:: annotation id:: 644f5b71-6ca0-47f3-8a20-9cc15c1fa0e0 hl-color:: green
- simplistic 过分简单化的 hl-page:: 545 ls-type:: annotation id:: 644f7695-12e8-4ffe-a6c7-105b596300de hl-color:: green
- saga 冒险故事;传说;英雄事迹 ls-type:: annotation hl-page:: 561 hl-color:: green id:: 644fafd9-52b6-4378-b808-35d07dd67108
- underscore 下画线 hl-page:: 568 ls-type:: annotation id:: 644fb0c3-4e0b-41ea-8edb-01e11b936c65 hl-color:: green
-
Parallel Processors from Client to Cloud
ls-type:: annotation hl-page:: 586 hl-color:: yellow id:: 644fb152-58ba-4049-b461-37264438cfde

