OSTEP Chapter 43

This commit is contained in:
ridethepig 2023-04-16 17:50:10 +08:00
parent 1b2c362cdd
commit 539ca5c3d1
9 changed files with 1385 additions and 7 deletions

File diff suppressed because it is too large Load Diff

View File

@ -343,6 +343,7 @@
;; ;use Percent-encoding for other invalid characters
:file/name-format :triple-lowbar
:ui/show-brackets? true
:feature/enable-timetracking? false
;; specify the format of the filename for journal files
;; :journal/file-name-format "yyyy_MM_dd"

View File

@ -343,6 +343,7 @@
;; ;use Percent-encoding for other invalid characters
:file/name-format :triple-lowbar
:ui/show-brackets? false
:feature/enable-timetracking? false
;; specify the format of the filename for journal files
;; :journal/file-name-format "yyyy_MM_dd"

View File

@ -343,6 +343,7 @@
;; ;use Percent-encoding for other invalid characters
:file/name-format :triple-lowbar
:ui/show-brackets? true
:feature/enable-timetracking? false
;; specify the format of the filename for journal files
;; :journal/file-name-format "yyyy_MM_dd"

View File

@ -342,7 +342,8 @@
;; ;use triple underscore `___` for slash `/` in page title
;; ;use Percent-encoding for other invalid characters
:file/name-format :triple-lowbar
:ui/show-brackets? true
:ui/show-brackets? false
:feature/enable-timetracking? false
;; specify the format of the filename for journal files
;; :journal/file-name-format "yyyy_MM_dd"

View File

@ -342,7 +342,7 @@
;; ;use triple underscore `___` for slash `/` in page title
;; ;use Percent-encoding for other invalid characters
:file/name-format :triple-lowbar
:ui/show-brackets? false
:ui/show-brackets? true
:feature/enable-timetracking? false
;; specify the format of the filename for journal files

View File

@ -0,0 +1,7 @@
:root {
--ls-font-family: Noto Sans CJK SC, Helvetica Neue, sans-serif;
}
* {
font-variant-ligatures: none !important;
}

View File

@ -908,7 +908,6 @@ file-path:: ../assets/ostep_1681115599584_0.pdf
id:: 6437a50f-5c6c-47ff-9179-ac48118342d7
hl-color:: yellow
- **IO time**
collapsed:: true
- **Rotational Delay**: wait for the desired sector to rotate under the disk head
hl-page:: 466
ls-type:: annotation
@ -1049,6 +1048,7 @@ file-path:: ../assets/ostep_1681115599584_0.pdf
ls-type:: annotation
id:: 6437f261-2d97-4f0c-85aa-06dd6d230ce0
hl-color:: yellow
collapsed:: true
- spread the blocks of the array across the disks in a round-robin fashion
ls-type:: annotation
hl-page:: 483
@ -1129,6 +1129,7 @@ file-path:: ../assets/ostep_1681115599584_0.pdf
ls-type:: annotation
id:: 6438241b-f487-4cf3-b717-60811340a5bd
hl-color:: yellow
collapsed:: true
- Improved version of *RAID4*, RAID5 rotate the parity block across drives.
- ((64382b1e-7c59-4729-a70f-68005b0640b4))
- **Performance**
@ -1504,7 +1505,7 @@ file-path:: ../assets/ostep_1681115599584_0.pdf
ls-type:: annotation
id:: 643ac9e7-1479-4be9-81fe-acb750f363b4
hl-color:: yellow
- Potential Performance: large sequential read from a large file. However, with selected chunk size (threshold of going to another group), ==cost of seek between groups can be amortized==. The larger size of a chunk, the higher average bandwidth you will reach.
- Potential Performance Problem: large sequential read from a large file. However, with selected chunk size (threshold of going to another group), ==cost of seek between groups can be amortized==. The larger size of a chunk, the higher average bandwidth you will reach.
- Measuring File Locality
hl-page:: 550
ls-type:: annotation
@ -1549,3 +1550,291 @@ file-path:: ../assets/ostep_1681115599584_0.pdf
ls-type:: annotation
id:: 643acfc2-eef7-4c7f-a8a5-1740c8788159
hl-color:: yellow
collapsed:: true
- Crash Scenarios
ls-type:: annotation
hl-page:: 560
hl-color:: yellow
id:: 643b7d72-4da5-42cf-abd5-dbbc3d9332d7
collapsed:: true
- Consider a ==write operation with new data block allocation== in the `vsfs` introduced above, which involves 3 independent write to the disk
- Only one operation is done
- data block: no a problem for FS, as if the write never happened, though user data get lost
- inode: FS inconsistency, bitmap says it is not allocated while inode says it is, read garbage from the block
- data bitmap: FS inconsistency, space leak, the block won't be utilized forever
- Two operations are done
- inode and bitmap: read garbage, though FS consistent
- data block and inode/bitmap: inconsistent
- The File System Checker
ls-type:: annotation
hl-page:: 562
hl-color:: yellow
id:: 643b8161-a7b8-496d-9121-4ce20ee8deb6
collapsed:: true
- Let inconsistencies happen and then fix them later when rebooting. This approach cannot solve all problems (like data loss), the only goal is to make the FS metadata consistent internally. Run before the FS is mounted
hl-page:: 562
ls-type:: annotation
id:: 643b81ae-edfd-4a9e-9df2-fdba2175dde2
hl-color:: yellow
- Basic summary of what `fsck` does
- Superblock: if corrupt, use an ==alternative copy==
- Free blocks: scan inodes, (double/triple...) indirect blocks to collect ==information about allocated blocks== and use this information to ==correct the bitmap==.
- Inode links: traverse the whole directory tree and calculate ==reference count for each inode==. Verify this for each inode. Ff inode allocated without any directory referring to it, move to `lost+found`
- Duplicates: multiple inode pointers point to the same block. Copy the block or clear inode
- Bad blocks, Inode state, Directory checks, etc.
- Problem: too slow
- Journaling (or Write-Ahead Logging)
ls-type:: annotation
hl-page:: 564
hl-color:: yellow
id:: 643b86d9-afcc-4494-87d1-275502df79a7
- Basic Idea: Before writing the structures in place, first write a log elsewhere on the disk. If crash takes place during the actual update, FS can fix inconsistency according to the log.
- **Data Journaling**
ls-type:: annotation
hl-page:: 565
hl-color:: yellow
id:: 643b8dc6-e380-49e3-ab08-ce27aa8767e2
- **physical logging**: put the exact physical contents of the update in the journal
hl-page:: 565
ls-type:: annotation
id:: 643b8ebd-b694-4862-90f0-9fb0f1a847f5
hl-color:: yellow
- **checkpointing**: overwrite the old structures in the FS
hl-page:: 565
ls-type:: annotation
id:: 643b8eed-3b37-4ac0-b924-9a2d035f2517
hl-color:: yellow
- **transaction identifier**: transaction begin including information about the pending update, and transaction end marker
hl-page:: 565
ls-type:: annotation
id:: 643b8f7d-76a2-477e-89ea-1321475b3dbe
hl-color:: yellow
- Journal write: Write the transaction (*Tx Begin* mark, data to update, *Tx End* mark) to log
- To make things faster, instead of issuing serial write requests, we may ==merge these requests.==
id:: 643b9070-8e4c-422a-8173-388fb801930d
- To avoid possible data loss during a single issue (due to internal disk scheduling), the *Tx End* mark must be written with ==a separate request==, while other part of the log can be issued as a package.
- Well, add a checksum is also a solution. With checksum, you can write all these stuff in a single request. If disk failed to propagate all of the bits to disk, this failure will be notice during the reboot scan and the log will be skipped.
hl-page:: 567
ls-type:: annotation
id:: 643b9816-e478-4f45-a5bd-fbe168fdc406
hl-color:: yellow
- Thus, this step can be split into 2 stages: ==Journal Write and Journal Commit==, which respectively means write Tx Begin mark and pending update and write Tx End mark.
- To re-use the log region, add a journal superblock on the disk for information about transaction checkpoint completion (free checkpointed ones). Perhaps a circular log.
- Protocol
- hl-page:: 570
ls-type:: annotation
id:: 643b9e00-4597-4a1a-890a-be95041f6b3b
hl-color:: yellow
1. **Journal write**: Write the contents of the transaction (Tx Begin, contents of the update) to the log; wait for these writes to complete.
2. **Journal commit**: Write the transaction commit block (Tx End) to the log; wait for the write to complete; the transaction is now committed.
3. **Checkpoint**: Write the contents of the update to their final locations within the file system.
4. **Free**: Some time later, mark the transaction free in the journal by updating the journal superblock.
- Recovery
ls-type:: annotation
hl-page:: 568
hl-color:: yellow
id:: 643b9301-4a07-459a-a413-5c2738560e10
- Crash before transaction commit, skip.
- Crash after transaction commit (but before checkpointing complete), replay.
- Redo Logging: On reboot, scan the log for committed transactions and try to write them again.
- **Metadata Journaling**
ls-type:: annotation
hl-page:: 570
hl-color:: yellow
id:: 643b96ec-18c9-4053-be7e-3b7d3b7dbbbd
- Data journaling doubles the traffic to disk, and seek between log area and main data area is costly.
- Metadata journaling writes metadata to log without data block. Data block is written directly to main data area before metadata is logged.
- Protocol
- hl-page:: 571
ls-type:: annotation
id:: 643b9b7c-a43e-4c50-9dec-1d8f30bae712
hl-color:: yellow
1. **Data write**: Write data to final location; wait for completion (optional).
2. **Journal metadata write**: Write the begin block and metadata to log; wait for writes to complete.
3. **Journal commit**: Write the transaction commit block (Tx End) to log; wait for the write to complete; the transaction (including data) is now committed.
4. **Checkpoint metadata**: Write the contents of the metadata update to their final locations in FS.
5. **Free**: Later, mark the transaction free in journal superblock.
- Actually, step 1 and step 2 can be issued concurrently, but Step 3 must wait for Step 1 and 2.
- Tricky Case: Block Reuse
ls-type:: annotation
hl-page:: 572
hl-color:: yellow
id:: 643b9c5a-c06b-4a4b-af75-9a2242069fc8
- Replay can cause data block to be overwritten when the block is re-used after deletion and the log is not freed in time.
- Well, the key point actually lies in that, directory information is considered as metadata. If the original block is a directory, the following operation sequence will cause problem: modify the directory entries, delete the directory, re-used the directory's block for a file. The recovery process will overwritten the file's data block with the old, deleted directory data.
- Other Approaches
ls-type:: annotation
hl-page:: 574
hl-color:: yellow
id:: 643ba14d-00f4-4f92-921c-740f3b6def61
- Soft updates: carefully order the writes to ensure on-disk structure is consistent at any time
- COW: never overwrite in place
- back-pointer: add backward pointer to inode to check consistency
- optimistic crash consistency: kind of transaction checksum
- premise
ls-type:: annotation
hl-page:: 563
hl-color:: green
id:: 643b824d-2732-46ae-961d-74a06db18138
- tad
ls-type:: annotation
hl-page:: 563
hl-color:: green
id:: 643b824f-c588-4b82-93e6-393016d3b5b1
- hideous
ls-type:: annotation
hl-page:: 572
hl-color:: green
id:: 643b9c40-6329-4720-9d26-75a78701392c
- hairy
ls-type:: annotation
hl-page:: 572
hl-color:: green
id:: 643b9c49-9e7c-4379-a640-8aff152ea511
- ## Log-structured File Systems
hl-page:: 579
ls-type:: annotation
id:: 643b8dad-3813-4048-8d04-5eb93a6bd182
hl-color:: yellow
- **Writing To Disk Sequentially**
hl-page:: 580
ls-type:: annotation
id:: 643bab9f-04f5-4e2f-b232-d6cfead45619
hl-color:: yellow
- write all updates (including metadata) to the disk sequentially, e.g. write a new data block, and then write its newly updated inode sequentially after it (rather than seek to the inode region far away)
- **Write Buffering**
hl-page:: 581
ls-type:: annotation
id:: 643bac5e-ce70-4d9b-92c5-4fb6dda099d6
hl-color:: yellow
- Writing sequentially alone doesn't mean good performance. A ==large number of contiguous writes or one large write== is the key to good write performance.
- Before writing to the disk, LFS ==keeps track of updates in memory==; when it has received a sufficient number of updates, (a *segment*) it writes them to disk all at once.
hl-page:: 581
ls-type:: annotation
id:: 643bac81-cd5a-49aa-a81b-aff6c2405a40
hl-color:: yellow
- Segment size: similar to evaluation here ((6437feab-eceb-4f11-9ced-ae43e2798c0c)). The larger chunk size, the better performance.
hl-page:: 582
ls-type:: annotation
id:: 643bb0bc-9781-484e-b249-224d89414165
hl-color:: yellow
- The effective rate of writing $R_{\text{effective}}$ and chunk size $D$:
$$R_{\text{effective}} = \frac{D}{T_{\text{write}}} = \frac{D}{T_{\text{position}}+\frac{D}{R_{\text{peak}}} } \\ D = \frac{F}{1-F}\times R_{\text{peak}} \times T_{\text{position}}$$
- **The Inode Map**, Finding inodes
hl-page:: 583
ls-type:: annotation
id:: 643bb0f6-b84c-469e-8188-0db6e86f36e8
hl-color:: yellow
- The i-map is a structure that maps inode-number to the disk address of the most recent version of the inode
hl-page:: 583
ls-type:: annotation
id:: 643bb162-9f99-4740-8fd6-859f236c1855
hl-color:: yellow
- LFS places chunks of the ==inode map right next to the other new information==. For example, when appending a data block to a file, LFS actually writes the new data block, its inode, and a piece of the inode map all together.
- **The Checkpoint Region**
hl-page:: 585
ls-type:: annotation
id:: 643bb250-205a-4ae9-8cff-0d715cfa6b7d
hl-color:: yellow
- Contains pointers to the latest pieces of the inode map. Note the checkpoint region is only updated periodically, without reduce performance too much.
- The look up process
- First look up CR for i-map (often cached in memory), then consult i-map for the directory's inode, then get file inode number from directory, finally consult i-map again for file's inode
- recursive update problem: Whenever an inode is updated, its location on disk changes. This would have also entailed an update to the directory that points to this file (change the pointer field, thus the directory needs to be written to a new location), which then would have mandated a change to the parent of that directory, and so on, all the way up the file system tree.
hl-page:: 586
ls-type:: annotation
id:: 643bb4de-bc1f-4f61-a5dd-036867e85fe7
hl-color:: yellow
- This won't be a problem for LFS. LFS maps inode number to address and directories store inode numbers rather than addresses, so even the inode moves to a new location there is no need to change the directory.
- Garbage Collection
ls-type:: annotation
hl-page:: 587
hl-color:: yellow
id:: 643bb6cb-61ae-4231-aaf4-d78f1b1a7851
- LFS leaves old versions of file structures scattered throughout the disk, though only the latest version is needed. Therefore, LFS has to periodically ==clean these old versions== of data and metadata.
- LFS cleaner works on a ==segment-by-segment basis==. Read in a number of old segments, collect live blocks, write them out to a new set of segments and finally free the old segments.
hl-page:: 588
ls-type:: annotation
id:: 643bb91c-62a0-4154-8d12-c9ae356a4fc7
hl-color:: yellow
- Determining Block Liveness
ls-type:: annotation
hl-page:: 588
hl-color:: yellow
id:: 643bb7e4-6d90-4d59-8930-29a243862288
- segment summary block: inode number and in-file offset of each data block
hl-page:: 588
ls-type:: annotation
id:: 643bba1e-4f7b-4291-bdd8-966dd366748c
hl-color:: yellow
- Pseudocode depiction
- ```python
# A -> block address
# N -> inode number
# T -> offset in file
(N,T) = SegmentSummary[A]
inode = Read(imap[N])
if (inode[T] == A):
return live
else:
return dead
```
- **version number**: in some cases (e.g., file deleted), LFS records file's version number in imap and summary block, and compares them during GC to speed up the check
hl-page:: 589
ls-type:: annotation
id:: 643bbc21-a025-4dd4-bdc2-4a1eb68abf5e
hl-color:: yellow
- Crash Recovery
ls-type:: annotation
hl-page:: 590
hl-color:: yellow
id:: 643bbd43-a342-4445-a808-b9800790a83c
- General write scheme
- LFS organizes writes in a log, i.e. the CR points to a head and tail segment, and each segment points to the next segment to write. CR is propagated to disk periodically.
- To make it clear, there is no separate "log" space on the disk similar to what journaling FSs do. The segments written to the disk are logs by themselves. See [Page 30, Figure 4-1, R92](https://www2.eecs.berkeley.edu/Pubs/TechRpts/1992/CSD-92-696.pdf)
- Checkpoint Region
- LFS keeps 2 CRs (at both ends of the disk) and write alternately. On writing, LFS first writes header (with timestamp), then body, finally a last block (with timestamp). In this way, crashes can be detected through inconsistent timestamps, and LFS can choose the latest CR to use.
- Roll Forward
hl-page:: 590
ls-type:: annotation
id:: 643bc080-2693-4a74-b261-56f92e3c75e4
hl-color:: yellow
- The basic idea is to start with the last checkpoint region, find the end of the log (included in the CR), and then use that to read through the next segments and see if there are any valid updates.
hl-page:: 590
ls-type:: annotation
id:: 643bc10e-be20-47b3-bab4-713493dd5153
hl-color:: yellow
- ## Flash-based SSDs
ls-type:: annotation
hl-page:: 595
hl-color:: yellow
id:: 643ba369-83df-42f9-9ee9-b45d4652e8fb
- ## Data Integrity and Protection
ls-type:: annotation
hl-page:: 619
hl-color:: yellow
id:: 643ba392-acd9-4255-930e-a97f94fb28ef
- spouse
ls-type:: annotation
hl-page:: 633
hl-color:: green
id:: 643ba3b2-5a2a-4589-a871-62ad213de195
- mandate
ls-type:: annotation
hl-page:: 586
hl-color:: green
id:: 643bb439-b7d0-4170-9417-cd900062bfbd
- entail
ls-type:: annotation
hl-page:: 586
hl-color:: green
id:: 643bb533-1fee-4b9a-9965-7a63016d5591
- ceremonious
ls-type:: annotation
hl-page:: 587
hl-color:: green
id:: 643bb6e8-6466-4574-aa04-4ea25b3e9034
- cease
ls-type:: annotation
hl-page:: 595
hl-color:: green
id:: 643bc4b4-dc22-471c-9229-558a42904cc8