=========================================================
ϡ
Linux-3.3/Documentation/filesystems/logfs.txt Ǥ
Ρ JF ץ < http://www.linux.or.jp/JF/ >
  2012/5/31
  Seiji Kaneko < skaneko at mbn dot or dot jp >
=========================================================

#The LogFS Flash Filesystem
#==========================
LogFS եåե륷ƥ
================================

#Specification
#=============

====

#Superblocks
#-----------
ѡ֥å
----------------

#Two superblocks exist at the beginning and end of the filesystem.
#Each superblock is 256 Bytes large, with another 3840 Bytes reserved
#for future purposes, making a total of 4096 Bytes.
ե륷ƥκǽȺǸˡưĤΥѡ֥å ( 2 )
ޤ 256 ХȤĹǡγĥΤ 3840 ХȤꥶ֤
Ƥ뤿ᡢ碌 4096 ХȤˤʤޤ

#Superblock locations may differ for MTD and block devices.  On MTD the
#first non-bad block contains a superblock in the first 4096 Bytes and
#the last non-bad block contains a superblock in the last 4096 Bytes.
#On block devices, the first 4096 Bytes of the device contain the first
#superblock and the last aligned 4096 Byte-block contains the second
#superblock.
ѡ֥åΰ֤ϡMTD ȥ֥åǥХǰۤʤǽޤ
MTD ξ硢ǽɤ̵֥å˺ǽ 4096 ХȤΥѡ֥å
ƺǸɤ̵ѡ֥å˺Ǹ 4096 ХʬǼ
ޤ֥åǥХǤϡǥХκǽ 4096 ХȤ˺ǽΥѡ
åƺǸΥ饤󤵤줿 4096 ХȤΥ֥åܤΥ
֥åǼޤ

#For the most part, the superblocks can be considered read-only.  They
#are written only to correct errors detected within the superblocks,
#move the journal and change the filesystem parameters through tunefs.
#As a result, the superblock does not contain any fields that require
#constant updates, like the amount of free space, etc.
ؤɤξ硢ѡ֥åɤ߽ФѤȤưޤѡ֥
ϡ֥åǥ顼ȯ줿㡼ʥΰư
tunefs ˤե륷ƥѥ᡼ѹǤΤ߽񤭹ޤޤ
̡ѡ֥åˤ϶̤ʤɤŪ˹Ԥեɤ
ޤޤޤ

#Segments
#--------

----------

#The space in the device is split up into equal-sized segments.
#Segments are the primary write unit of LogFS.  Within each segments,
#writes happen from front (low addresses) to back (high addresses.  If
#only a partial segment has been written, the segment number, the
#current position within and optionally a write buffer are stored in
#the journal.
ǥХ֤ϡƱ礭ΥȤʬ䤵ƤޤȤ
LogFS ǤδŪʽ񤭹ñ̤ǤƥȤǤϡ񤭹ߤϺǽ (
̥ɥ쥹) Ǹ (̥ɥ쥹) ˸äƹԤޤ
ȤΰΤߤ񤭹ޤ硢ֹ桢κǽ񤭹
߰֡ץȤƽ񤭹ߥХåեλĤ㡼ʥ¸ޤ

#Segments are erased as a whole.  Therefore Garbage Collection may be
#required to completely free a segment before doing so.
Ȥϡñ̤ǾõޤΤᡢõˤϥ
Ȥ˶ˤ뤿Υ١쥯ɬפˤʤޤ

#Journal
#--------
㡼ʥ
----------

#The journal contains all global information about the filesystem that
#is subject to frequent change.  At mount time, it has to be scanned
#for the most recent commit entry, which contains a list of pointers to
#all currently valid entries.
㡼ʥˤϡե륷ƥˤѹ륰Х󤹤٤Ƥ
ǼޤޥȻˡ㡼ʥ򥹥󤷤ƺǸ˥ߥåȤ줿
ȥɬפͭޤΥȥˤϡͭƤΥ
ȥؤΥݥ󥿤ΥꥹȤǼƤޤ

#Object Store
#------------
֥ȥȥ
------------------

#All space except for the superblocks and journal is part of the object
#store.  Each segment contains a segment header and a number of
#objects, each consisting of the object header and the payload.
#Objects are either inodes, directory entries (dentries), file data
#blocks or indirect blocks.
ѡ֥åȥ㡼ʥʳΰϡ֥ȥȥˤʤޤ
ƥȤϥȥإåʣΥ֥Ȥޤߤޤޤƥ
֥Ȥϥ֥ȥإåȥڥɤʤޤ֥Ȥϡ
inode ǥ쥯ȥꥨȥ (dentry) եǡ֥å
ܻȥ֥åβ줫Ǥ

#Levels
#------
٥
------

#Garbage collection (GC) may fail if all data is written
#indiscriminately.  One requirement of GC is that data is separated
#roughly according to the distance between the tree root and the data.
#Effectively that means all file data is on level 0, indirect blocks
#are on levels 1, 2, 3 4 or 5 for 1x, 2x, 3x, 4x or 5x indirect blocks,
#respectively.  Inode file data is on level 6 for the inodes and 7-11
#for indirect blocks.
ƤΥǡФä˽񤫤Ƥ硢١쥯
(GC) ԤǽФƤޤGC ưɬ׾ΰĤϡǡ
꡼Υ롼ȤȥǡޤǤεΥ˽ä绨Ĥʬ֤Ƥ뤳ȤǤ
ϡ¸ŪˤϡƤΥեǡ٥ 0 ˤä硢1x, 2x,
3x, 4x, 5x δܥ֥åФƥ٥ 1, 2, 3, 4, 5 δܥ֥å
бȤȤǤInode եǡ inode Υ٥ 6 ֤
졢7-11 ܥ֥åˤʤޤ

#Each segment contains objects of a single level only.  As a result,
#each level requires its own separate segment to be open for writing.
ƥȤˤñΥ٥Υ֥ȤΤߤǼޤη̡
񤭹ߤȼץΤˤϡƥ٥ѤΥȤɬפˤʤ
ޤ

#Inode File
#----------
Inode ե
--------------

#All inodes are stored in a special file, the inode file.  Single
#exception is the inode file's inode (master inode) which for obvious
#reasons is stored in the journal instead.  Instead of data blocks, the
#leaf nodes of the inode files are inodes.
inode  inode եȤ̤Υե˳ǼޤĤ
㳰ϡinode ե inode (ޥ inode) ǡͳϤ餫
פޤ㡼ʥ˳Ǽޤǡ֥åξȤϰۤʤꡢinode
եΥ꡼եΡɤ inode ˤʤޤ

#Aliases
#-------
ꥢ
----------

#Writes in LogFS are done by means of a wandering tree.  A nave
#implementation would require that for each write or a block, all
#parent blocks are written as well, since the block pointers have
#changed.  Such an implementation would not be very efficient.
LogFS ؤν񤭹ߤϡĥ꡼õˤäƹԤޤѤʼȰ
Υ֥åν񤭹ˤõɬפˤʤꡢ˥֥åݥ󥿤
뤿Ƥοƥ֥åؤν񤭹ߤԤޤΤ褦ʼϤ
ƤΨŪȤϸޤ

#In LogFS, the block pointer changes are cached in the journal by means
#of alias entries.  Each alias consists of its logical address - inode
#number, block index, level and child number (index into block) - and
#the changed data.  Any 8-byte word can be changes in this manner.
LogFS Ǥϡ֥åݥ󥿤ιϥꥢȥȤƥ㡼ʥ
å夵ޤƥꥢˤɥ쥹 (inode ֹ桢֥å
ǥå٥Ȼֹ (֥åؤΥǥå)) ȡǡ
ǼƤޤ 8 Х (1 ) ϡΤ褦ƤѹǤǽ
ޤ
<!-- TODO  -->

#Currently aliases are used for block pointers, file size, file used
#bytes and the height of an inodes indirect tree.
ߡ֥åݥ󥿡ե륵եλȤäƤХȿ
inode δܥĥ꡼οǡγǼ˥ꥢѤƤޤ

#Segment Aliases
#---------------
ȥꥢ
--------------------

#Related to regular aliases, these are used to handle bad blocks.
#Initially, bad blocks are handled by moving the affected segment
#content to a spare segment and noting this move in the journal with a
#segment alias, a simple (to, from) tupel.  GC will later empty this
#segment and the alias can be removed again.  This is used on MTD only.
̾ΥꥢȤ褯ƤޤϥХåɥ֥åѤޤ
Хåɥ֥å򸫤Ĥ硢Τ륻ȤڥȤ˰
ư졢ưȤȥꥢȤơñʥץ (ư
ư) ǥ㡼ʥ˵ϿޤΤΤ GC ΥȤ
ˤꥢϺٺǽˤʤޤεǽ MTD ǤΤ߻Ѥޤ

#Vim
#---
Vim
---

#By cleverly predicting the life time of data, it is possible to
#separate long-living data from short-living data and thereby reduce
#the GC overhead later.  Each type of distinc life expectency (vim) can
#have a separate segment open for writing.  Each (level, vim) tupel can
#be open just once.  If an open segment with unknown vim is encountered
#at mount time, it is closed and ignored henceforth.
ǡμ̿򸭤ͽ¬뤳ȤˤꡢĹμ̿ĥǡûμ̿
ʤǡȤ̤η̸Ǥ GC Хإåɤ︺뤳
Ȥǽˤʤޤġͽ¬̿ (vim) פΩΥȤ
񤭹߲ǽǤ ٥/vim ץϰץǽǤޥ
Ȼ vim ʳ줿Ȥ򸫤Ĥ硢Ĥưʹ
̵뤵ޤ

#Indirect Tree
#-------------
ܥĥ꡼
----------

#Inodes in LogFS are similar to FFS-style filesystems with direct and
#indirect block pointers.  One difference is that LogFS uses a single
#indirect pointer that can be either a 1x, 2x, etc. indirect pointer.
#A height field in the inode defines the height of the indirect tree
#and thereby the indirection of the pointer.
LogFS  inode  FFS ˻ե륷ƥǡľܥ֥åݥ󥿤ȴ
ܥ֥åݥ󥿤äƤޤĤΰ㤤ϡLogFS ǤϰĤδܥݥ
󥿤 1x 2x ʤɤδܥݥ󥿤Ȥʤꤦۤʤޤinode  height
եɤϴܥĥ꡼οĤޤݥ󥿤δܻȤοƤ


#Another difference is the addressing of indirect blocks.  In LogFS,
#the first 16 pointers in the first indirect block are left empty,
#corresponding to the 16 direct pointers in the inode.  In ext2 (maybe
#others as well) the first pointer in the first indirect block
#corresponds to logical block 12, skipping the 12 direct pointers.
#So where ext2 is using arithmetic to better utilize space, LogFS keeps
#arithmetic simple and uses compression to save space.
⤦Ĥΰ㤤ϡܥ֥åΥɥ쥹ˡˤޤLogFS Ǥϡǽ
δܥ֥åνᤫ 16 ĤΥݥ󥿤϶ǻĤ졢 inode  16
Ĥľܥ֥åΥݥ󥿤бޤext2 (餯¾ˤ) Ǥϡǽ
ܥ֥åνᤫ 12 ĤΥݥ󥿤϶ǻĤ졢12 Ĥľܥ֥å
ݥ󥿤򥹥åפƤޤĤޤꡢext2 ǤϷ׻Ԥäƥڡ
Ψ򤢤ƤΤФơLogFS ǤϷ׻ñαƥڡ
̤ǹԤȤ߷פǤ

#Compression
#-----------

----

#Both file data and metadata can be compressed.  Compression for file
#data can be enabled with chattr +c and disabled with chattr -c.  Doing
#so has no effect on existing data, but new data will be stored
#accordingly.  New inodes will inherit the compression flag of the
#parent directory.
եǡȥ᥿ǡξ̲ǽǤեǡΰ̤ϡ
chattr +c ͭchattr -c ̵Ǥޤϡ¸ǡ
ϱƶʤǡ˽äƽ񤭹ޤޤ inode ϡ
ǥ쥯ȥΰ̥ե饰Ѥޤ

#Metadata is always compressed.  However, the space accounting ignores
#this and charges for the uncompressed size.  Failing to do so could
#result in GC failures when, after moving some data, indirect blocks
#compress worse than previously.  Even on a 100% full medium, GC may
#not consume any extra space, so the compression gains are lost space
#to the user.
᥿ǡϾ˰̤ޤâ̷׻ǤϤΰ̵̤뤵졢
̤ΥǷ׻ԤޤΤ褦ˤʤϡǡư
˴ܥ֥åΥǡ٤ư̤ʤä硢GC Ԥ뤿
Ǥ100% դΥǥǤ⡢GC ɲ̤񤹤뤳Ȥϵʤ
ᡢ̤ˤ̤ϥ桼ѤǤʤΰȤˤʤޤ

#However, they are not lost space to the filesystem internals.  By
#cheating the user for those bytes, the filesystem gained some slack
#space and GC will run less often and faster.
âե륷ƥŪˤϤ̵ΰȤǤϤޤ󡣥
ˤϱΥ𤹤뤳Ȥˤäơե륷ƥ;ʬΰ
ԤGC ¹Բޤ®ư褦ˤʤޤ

#Garbage Collection and Wear Leveling
#------------------------------------
١쥯ȥ٥
--------------------------------------

#Garbage collection is invoked whenever the number of free segments
#falls below a threshold.  The best (known) candidate is picked based
#on the least amount of valid data contained in the segment.  All
#remaining valid data is copied elsewhere, thereby invalidating it.
١쥯ϡȿͤ򲼲äˤϤ
ⵯưޤɤ (Τ) 䤬Ȥ˳ǼƤͭ
̤ʤ򤵤ޤĤäƤͭǡ¾ξ˥ԡ
졢Ȥ̵ޤ

#The GC code also checks for aliases and writes then back if their
#number gets too large.
GC ɤǤϥꥢå(ͤۤۤ) 礭ʤ
ƤˤϽ񤭹ߤ»ܤޤ

#Wear leveling is done by occasionally picking a suboptimal segment for
#garbage collection.  If a stale segments erase count is significantly
#lower than the active segments' erase counts, it will be picked.  Wear
#leveling is rate limited, so it will never monopolize the device for
#more than one segment worth at a time.
٥󥰤ϡ١쥯˺ŬǤϤʤȤֻ
Ф뤳ȤǼ¹Ԥޤ̵ʥȤξõ (:ڡ
Ѿõ) ͭʥȤξõ˾ʤ׾硢Υ
Ȥ򤵤ޤ٥󥰤٤¤Ƥꡢ٤˰
ĤΥʬʾ˥ǥХͭ뤳ȤϤޤ

#Values for "occasionally", "significantly lower" are compile time
#constants.
ǡֻפȡ˾ʤפȤȽǤƤͤϡѥ
ꤵޤ

#Hashed directories
#------------------
ǥ쥯ȥϥå
--------------------

#To satisfy efficient lookup(), directory entries are hashed and
#located based on the hash.  In order to both support large directories
#and not be overly inefficient for small directories, several hash
#tables of increasing size are used.  For each table, the hash value
#modulo the table size gives the table index.
ΨŪ lookup() ¸뤿ˡǥ쥯ȥꥨȥϥϥå岽
ƥϥå򸵤˰֤ꤵ褦ˤʤäƤޤ (ե
¿) ǥ쥯ȥ򥵥ݡȤƱ˾ʥǥ쥯ȥǲ٤Ψ
ʤʤ褦ˤ뤿ᡢ () 礭ʤʣΥϥåơ֥
ȤƤޤƥơ֥Ǥϡϥåͤơ֥륵ǳä;
ơ֥륤ǥåȤƻȤޤ

#Tables sizes are chosen to limit the number of indirect blocks with a
#fully populated table to 0, 1, 2 or 3 respectively.  So the first
#table contains 16 entries, the second 512-16, etc.
ơ֥륵ϡ 0123 ؤΥơ֥˴ܥ֥åդ˳
Ǽ¤򸵤򤵤ޤĤޤǽΥơ֥뤬 16 ȥ
Ǽܤ 512-16 ȥǼΤ褦ˤʤޤ

#The last table is special in several ways.  First its size depends on
#the effective 32bit limit on telldir/seekdir cookies.  Since logfs
#uses the upper half of the address space for indirect blocks, the size
#is limited to 2^31.  Secondly the table contains hash buckets with 16
#entries each.
ǸΥơ֥͡ΰ̣̤ǤޤΥϼ¼Ū
telldir/seekdir å 32bit ¤˰¸Ƥޤϡlogfs 
ɥ쥹֤ξȾʬܥ֥åѤƤ뤿ᡢ 2^31 ¤
뤿Ǥˡơ֥ˤϳ 16 ȥΥϥåХåȤǼ
Ƥޤ

#Using single-entry buckets would result in birthday "attacks".  At
#just 2^16 used entries, hash collisions would be likely (P >= 0.5).
#My math skills are insufficient to do the combinatorics for the 17x
#collisions necessary to overflow a bucket, but testing showed that in
#10,000 runs the lowest directory fill before a bucket overflow was
#188,057,130 entries with an average of 315,149,915 entries.  So for
#directory sizes of up to a million, bucket overflows should be
#virtually impossible under normal circumstances.
ȥ꤬ĤʤХåȤǤϡפα¿ˤʤäƤޤ
ä 2^16 ȥꤷȤäƤʤˤϡϥåͳΨ
>=0.5 ˤʤǤ礦ؤդǤϤޤΤǡХåȤ줵
Τɬפ 17 ξͤȤ߹碌η׻ϤǤޤ󤬡ǲ̤Υǥ쥯
ȥ 10,000 ¸η̤ˤȥХåȤޤǺǾ
188,057,130 ȥ꤬ǤʿŪˤ 315,149,915 ȥ꤬ǽ
Ǥäơǥ쥯ȥꥵ 1,000,000 ʲʤ顢ХåȥХ
̾ξﲼǤϤޤʤȻפޤ

#With carefully chosen filenames, it is obviously possible to cause an
#overflow with just 21 entries (4 higher tables + 16 entries + 1).  So
#there may be a security concern if a malicious user has write access
#to a directory.
ե̾տ򤹤ʤ顢ñ 21 ȥ (4 Ĥξ̥ơ֥ +
16 ȥ + 1) ǥХեȯ뤳ȤǤΤ⤢餫Ǥ
顢դä桼ǥ쥯ȥؤν񤭹߸¤äƤ硢
ƥηǰǤ礦

#Open For Discussion
#===================
٤
==================

#Device Address Space
#--------------------
ǥХɥ쥹
--------------------

#A device address space is used for caching.  Both block devices and
#MTD provide functions to either read a single page or write a segment.
#Partial segments may be written for data integrity, but where possible
#complete segments are written for performance on simple block device
#flash media.
ǥХɥ쥹֤ϥåΤѤޤ֥åǥХ MTD
ξڡɤ߽ФȥȽ񤭹ߤξδؿ󶡤Ƥޤ
ȤΰΤߤǡΤ˽񤭹ळȤ⤢ޤñ
ʥ֥åǥХեåǥǤǽΤᡢǽʸ¤ꥻ
Ȥޤ뤴Ȥν񤭹ߤԤޤ

#Meta Inodes
#-----------
᥿ inode
----------

#Inodes are stored in the inode file, which is just a regular file for
#most purposes.  At umount time, however, the inode file needs to
#remain open until all dirty inodes are written.  So
#generic_shutdown_super() may not close this inode, but shouldn't
#complain about remaining inodes due to the inode file either.  Same
#goes for mapping inode of the device address space.
inode  inode ե˽񤭹ޤޤؤɤŪǤϡinode ե
ñʤ̤ΥեǤumount  inode եƤΥƥ
inode ν񤭹ߤλޤǥץ󤷤Ƥɬפޤäơ
generic_shutdown_super() ǤϤ inode 򥯥Ǥޤ󤬡
inode եΤˡֻĤäƤ inode פȤٹФȤ
ޤ󡣥ǥХɥ쥹֤ǤΥޥåפ줿 inode ˤĤƤƱ
ޤ

#Currently logfs uses a hack that essentially copies part of fs/inode.c
#code over.  A general solution would be preferred.
ߡlogfs Ǥ fs/inode.c ɤΰ򥳥ԡϥåбƤ
ѤβˡƤޤ

#Indirect block mapping
#----------------------
ܥ֥åޥåԥ
----------------------

#With compression, the block device (or mapping inode) cannot be used
#to cache indirect blocks.  Some other place is required.  Currently
#logfs uses the top half of each inode's address space.  The low 8TB
#(on 32bit) are filled with file data, the high 8TB are used for
#indirect blocks.
̤Ѥ硢֥åǥХ (ӥޥåԥ inode) ϴܥ֥
򥭥å夹뤿Ѥ뤳ȤϤǤޤ󡣤ɤ¾ξ꤬ɬפ
ߡlogfs ϳ inode ɥ쥹֤ξȾʬȤ褦ˤʤäƤޤ
(32bit ξ) ̤ 8TB ˤϥեǡǼ졢 8TB ϴܥ
åΤ˻Ȥޤ

#One problem is that 16TB files created on 64bit systems actually have
#data in the top 8TB.  But files >16TB would cause problems anyway, so
#only the limit has changed.
꤬äơ64bit ƥǺ줿 16TB Υեˤϡºݤ
̤ 8TB ˥ǡǼ뤳Ȥˤʤޤˤ 16TB
礭եǤ꤬ȯޤ顢¤ѤäȤ
Ǥ

