| Introduction | 
 | ============ | 
 |  | 
 | dm-cache is a device mapper target written by Joe Thornber, Heinz | 
 | Mauelshagen, and Mike Snitzer. | 
 |  | 
 | It aims to improve performance of a block device (eg, a spindle) by | 
 | dynamically migrating some of its data to a faster, smaller device | 
 | (eg, an SSD). | 
 |  | 
 | This device-mapper solution allows us to insert this caching at | 
 | different levels of the dm stack, for instance above the data device for | 
 | a thin-provisioning pool.  Caching solutions that are integrated more | 
 | closely with the virtual memory system should give better performance. | 
 |  | 
 | The target reuses the metadata library used in the thin-provisioning | 
 | library. | 
 |  | 
 | The decision as to what data to migrate and when is left to a plug-in | 
 | policy module.  Several of these have been written as we experiment, | 
 | and we hope other people will contribute others for specific io | 
 | scenarios (eg. a vm image server). | 
 |  | 
 | Glossary | 
 | ======== | 
 |  | 
 |   Migration -  Movement of the primary copy of a logical block from one | 
 | 	       device to the other. | 
 |   Promotion -  Migration from slow device to fast device. | 
 |   Demotion  -  Migration from fast device to slow device. | 
 |  | 
 | The origin device always contains a copy of the logical block, which | 
 | may be out of date or kept in sync with the copy on the cache device | 
 | (depending on policy). | 
 |  | 
 | Design | 
 | ====== | 
 |  | 
 | Sub-devices | 
 | ----------- | 
 |  | 
 | The target is constructed by passing three devices to it (along with | 
 | other parameters detailed later): | 
 |  | 
 | 1. An origin device - the big, slow one. | 
 |  | 
 | 2. A cache device - the small, fast one. | 
 |  | 
 | 3. A small metadata device - records which blocks are in the cache, | 
 |    which are dirty, and extra hints for use by the policy object. | 
 |    This information could be put on the cache device, but having it | 
 |    separate allows the volume manager to configure it differently, | 
 |    e.g. as a mirror for extra robustness.  This metadata device may only | 
 |    be used by a single cache device. | 
 |  | 
 | Fixed block size | 
 | ---------------- | 
 |  | 
 | The origin is divided up into blocks of a fixed size.  This block size | 
 | is configurable when you first create the cache.  Typically we've been | 
 | using block sizes of 256KB - 1024KB.  The block size must be between 64 | 
 | (32KB) and 2097152 (1GB) and a multiple of 64 (32KB). | 
 |  | 
 | Having a fixed block size simplifies the target a lot.  But it is | 
 | something of a compromise.  For instance, a small part of a block may be | 
 | getting hit a lot, yet the whole block will be promoted to the cache. | 
 | So large block sizes are bad because they waste cache space.  And small | 
 | block sizes are bad because they increase the amount of metadata (both | 
 | in core and on disk). | 
 |  | 
 | Writeback/writethrough | 
 | ---------------------- | 
 |  | 
 | The cache has two modes, writeback and writethrough. | 
 |  | 
 | If writeback, the default, is selected then a write to a block that is | 
 | cached will go only to the cache and the block will be marked dirty in | 
 | the metadata. | 
 |  | 
 | If writethrough is selected then a write to a cached block will not | 
 | complete until it has hit both the origin and cache devices.  Clean | 
 | blocks should remain clean. | 
 |  | 
 | A simple cleaner policy is provided, which will clean (write back) all | 
 | dirty blocks in a cache.  Useful for decommissioning a cache. | 
 |  | 
 | Migration throttling | 
 | -------------------- | 
 |  | 
 | Migrating data between the origin and cache device uses bandwidth. | 
 | The user can set a throttle to prevent more than a certain amount of | 
 | migration occurring at any one time.  Currently we're not taking any | 
 | account of normal io traffic going to the devices.  More work needs | 
 | doing here to avoid migrating during those peak io moments. | 
 |  | 
 | For the time being, a message "migration_threshold <#sectors>" | 
 | can be used to set the maximum number of sectors being migrated, | 
 | the default being 204800 sectors (or 100MB). | 
 |  | 
 | Updating on-disk metadata | 
 | ------------------------- | 
 |  | 
 | On-disk metadata is committed every time a REQ_SYNC or REQ_FUA bio is | 
 | written.  If no such requests are made then commits will occur every | 
 | second.  This means the cache behaves like a physical disk that has a | 
 | write cache (the same is true of the thin-provisioning target).  If | 
 | power is lost you may lose some recent writes.  The metadata should | 
 | always be consistent in spite of any crash. | 
 |  | 
 | The 'dirty' state for a cache block changes far too frequently for us | 
 | to keep updating it on the fly.  So we treat it as a hint.  In normal | 
 | operation it will be written when the dm device is suspended.  If the | 
 | system crashes all cache blocks will be assumed dirty when restarted. | 
 |  | 
 | Per-block policy hints | 
 | ---------------------- | 
 |  | 
 | Policy plug-ins can store a chunk of data per cache block.  It's up to | 
 | the policy how big this chunk is, but it should be kept small.  Like the | 
 | dirty flags this data is lost if there's a crash so a safe fallback | 
 | value should always be possible. | 
 |  | 
 | For instance, the 'mq' policy, which is currently the default policy, | 
 | uses this facility to store the hit count of the cache blocks.  If | 
 | there's a crash this information will be lost, which means the cache | 
 | may be less efficient until those hit counts are regenerated. | 
 |  | 
 | Policy hints affect performance, not correctness. | 
 |  | 
 | Policy messaging | 
 | ---------------- | 
 |  | 
 | Policies will have different tunables, specific to each one, so we | 
 | need a generic way of getting and setting these.  Device-mapper | 
 | messages are used.  Refer to cache-policies.txt. | 
 |  | 
 | Discard bitset resolution | 
 | ------------------------- | 
 |  | 
 | We can avoid copying data during migration if we know the block has | 
 | been discarded.  A prime example of this is when mkfs discards the | 
 | whole block device.  We store a bitset tracking the discard state of | 
 | blocks.  However, we allow this bitset to have a different block size | 
 | from the cache blocks.  This is because we need to track the discard | 
 | state for all of the origin device (compare with the dirty bitset | 
 | which is just for the smaller cache device). | 
 |  | 
 | Target interface | 
 | ================ | 
 |  | 
 | Constructor | 
 | ----------- | 
 |  | 
 |  cache <metadata dev> <cache dev> <origin dev> <block size> | 
 |        <#feature args> [<feature arg>]* | 
 |        <policy> <#policy args> [policy args]* | 
 |  | 
 |  metadata dev    : fast device holding the persistent metadata | 
 |  cache dev	 : fast device holding cached data blocks | 
 |  origin dev	 : slow device holding original data blocks | 
 |  block size      : cache unit size in sectors | 
 |  | 
 |  #feature args   : number of feature arguments passed | 
 |  feature args    : writethrough.  (The default is writeback.) | 
 |  | 
 |  policy          : the replacement policy to use | 
 |  #policy args    : an even number of arguments corresponding to | 
 |                    key/value pairs passed to the policy | 
 |  policy args     : key/value pairs passed to the policy | 
 | 		   E.g. 'sequential_threshold 1024' | 
 | 		   See cache-policies.txt for details. | 
 |  | 
 | Optional feature arguments are: | 
 |    writethrough  : write through caching that prohibits cache block | 
 | 		   content from being different from origin block content. | 
 | 		   Without this argument, the default behaviour is to write | 
 | 		   back cache block contents later for performance reasons, | 
 | 		   so they may differ from the corresponding origin blocks. | 
 |  | 
 | A policy called 'default' is always registered.  This is an alias for | 
 | the policy we currently think is giving best all round performance. | 
 |  | 
 | As the default policy could vary between kernels, if you are relying on | 
 | the characteristics of a specific policy, always request it by name. | 
 |  | 
 | Status | 
 | ------ | 
 |  | 
 | <#used metadata blocks>/<#total metadata blocks> <#read hits> <#read misses> | 
 | <#write hits> <#write misses> <#demotions> <#promotions> <#blocks in cache> | 
 | <#dirty> <#features> <features>* <#core args> <core args>* <#policy args> | 
 | <policy args>* | 
 |  | 
 | #used metadata blocks    : Number of metadata blocks used | 
 | #total metadata blocks   : Total number of metadata blocks | 
 | #read hits               : Number of times a READ bio has been mapped | 
 | 			     to the cache | 
 | #read misses             : Number of times a READ bio has been mapped | 
 | 			     to the origin | 
 | #write hits              : Number of times a WRITE bio has been mapped | 
 | 			     to the cache | 
 | #write misses            : Number of times a WRITE bio has been | 
 | 			     mapped to the origin | 
 | #demotions               : Number of times a block has been removed | 
 | 			     from the cache | 
 | #promotions              : Number of times a block has been moved to | 
 | 			     the cache | 
 | #blocks in cache         : Number of blocks resident in the cache | 
 | #dirty                   : Number of blocks in the cache that differ | 
 | 			     from the origin | 
 | #feature args            : Number of feature args to follow | 
 | feature args             : 'writethrough' (optional) | 
 | #core args               : Number of core arguments (must be even) | 
 | core args                : Key/value pairs for tuning the core | 
 | 			     e.g. migration_threshold | 
 | #policy args             : Number of policy arguments to follow (must be even) | 
 | policy args              : Key/value pairs | 
 | 			     e.g. 'sequential_threshold 1024 | 
 |  | 
 | Messages | 
 | -------- | 
 |  | 
 | Policies will have different tunables, specific to each one, so we | 
 | need a generic way of getting and setting these.  Device-mapper | 
 | messages are used.  (A sysfs interface would also be possible.) | 
 |  | 
 | The message format is: | 
 |  | 
 |    <key> <value> | 
 |  | 
 | E.g. | 
 |    dmsetup message my_cache 0 sequential_threshold 1024 | 
 |  | 
 | Examples | 
 | ======== | 
 |  | 
 | The test suite can be found here: | 
 |  | 
 | https://github.com/jthornber/thinp-test-suite | 
 |  | 
 | dmsetup create my_cache --table '0 41943040 cache /dev/mapper/metadata \ | 
 | 	/dev/mapper/ssd /dev/mapper/origin 512 1 writeback default 0' | 
 | dmsetup create my_cache --table '0 41943040 cache /dev/mapper/metadata \ | 
 | 	/dev/mapper/ssd /dev/mapper/origin 1024 1 writeback \ | 
 | 	mq 4 sequential_threshold 1024 random_threshold 8' |