|  | .. SPDX-License-Identifier: GPL-2.0-only | 
|  |  | 
|  | ======== | 
|  | dm-clone | 
|  | ======== | 
|  |  | 
|  | Introduction | 
|  | ============ | 
|  |  | 
|  | dm-clone is a device mapper target which produces a one-to-one copy of an | 
|  | existing, read-only source device into a writable destination device: It | 
|  | presents a virtual block device which makes all data appear immediately, and | 
|  | redirects reads and writes accordingly. | 
|  |  | 
|  | The main use case of dm-clone is to clone a potentially remote, high-latency, | 
|  | read-only, archival-type block device into a writable, fast, primary-type device | 
|  | for fast, low-latency I/O. The cloned device is visible/mountable immediately | 
|  | and the copy of the source device to the destination device happens in the | 
|  | background, in parallel with user I/O. | 
|  |  | 
|  | For example, one could restore an application backup from a read-only copy, | 
|  | accessible through a network storage protocol (NBD, Fibre Channel, iSCSI, AoE, | 
|  | etc.), into a local SSD or NVMe device, and start using the device immediately, | 
|  | without waiting for the restore to complete. | 
|  |  | 
|  | When the cloning completes, the dm-clone table can be removed altogether and be | 
|  | replaced, e.g., by a linear table, mapping directly to the destination device. | 
|  |  | 
|  | The dm-clone target reuses the metadata library used by the thin-provisioning | 
|  | target. | 
|  |  | 
|  | Glossary | 
|  | ======== | 
|  |  | 
|  | Hydration | 
|  | The process of filling a region of the destination device with data from | 
|  | the same region of the source device, i.e., copying the region from the | 
|  | source to the destination device. | 
|  |  | 
|  | Once a region gets hydrated we redirect all I/O regarding it to the destination | 
|  | device. | 
|  |  | 
|  | Design | 
|  | ====== | 
|  |  | 
|  | Sub-devices | 
|  | ----------- | 
|  |  | 
|  | The target is constructed by passing three devices to it (along with other | 
|  | parameters detailed later): | 
|  |  | 
|  | 1. A source device - the read-only device that gets cloned and source of the | 
|  | hydration. | 
|  |  | 
|  | 2. A destination device - the destination of the hydration, which will become a | 
|  | clone of the source device. | 
|  |  | 
|  | 3. A small metadata device - it records which regions are already valid in the | 
|  | destination device, i.e., which regions have already been hydrated, or have | 
|  | been written to directly, via user I/O. | 
|  |  | 
|  | The size of the destination device must be at least equal to the size of the | 
|  | source device. | 
|  |  | 
|  | Regions | 
|  | ------- | 
|  |  | 
|  | dm-clone divides the source and destination devices in fixed sized regions. | 
|  | Regions are the unit of hydration, i.e., the minimum amount of data copied from | 
|  | the source to the destination device. | 
|  |  | 
|  | The region size is configurable when you first create the dm-clone device. The | 
|  | recommended region size is the same as the file system block size, which usually | 
|  | is 4KB. The region size must be between 8 sectors (4KB) and 2097152 sectors | 
|  | (1GB) and a power of two. | 
|  |  | 
|  | Reads and writes from/to hydrated regions are serviced from the destination | 
|  | device. | 
|  |  | 
|  | A read to a not yet hydrated region is serviced directly from the source device. | 
|  |  | 
|  | A write to a not yet hydrated region will be delayed until the corresponding | 
|  | region has been hydrated and the hydration of the region starts immediately. | 
|  |  | 
|  | Note that a write request with size equal to region size will skip copying of | 
|  | the corresponding region from the source device and overwrite the region of the | 
|  | destination device directly. | 
|  |  | 
|  | Discards | 
|  | -------- | 
|  |  | 
|  | dm-clone interprets a discard request to a range that hasn't been hydrated yet | 
|  | as a hint to skip hydration of the regions covered by the request, i.e., it | 
|  | skips copying the region's data from the source to the destination device, and | 
|  | only updates its metadata. | 
|  |  | 
|  | If the destination device supports discards, then by default dm-clone will pass | 
|  | down discard requests to it. | 
|  |  | 
|  | Background Hydration | 
|  | -------------------- | 
|  |  | 
|  | dm-clone copies continuously from the source to the destination device, until | 
|  | all of the device has been copied. | 
|  |  | 
|  | Copying data from the source to the destination device uses bandwidth. The user | 
|  | can set a throttle to prevent more than a certain amount of copying occurring at | 
|  | any one time. Moreover, dm-clone takes into account user I/O traffic going to | 
|  | the devices and pauses the background hydration when there is I/O in-flight. | 
|  |  | 
|  | A message `hydration_threshold <#regions>` can be used to set the maximum number | 
|  | of regions being copied, the default being 1 region. | 
|  |  | 
|  | dm-clone employs dm-kcopyd for copying portions of the source device to the | 
|  | destination device. By default, we issue copy requests of size equal to the | 
|  | region size. A message `hydration_batch_size <#regions>` can be used to tune the | 
|  | size of these copy requests. Increasing the hydration batch size results in | 
|  | dm-clone trying to batch together contiguous regions, so we copy the data in | 
|  | batches of this many regions. | 
|  |  | 
|  | When the hydration of the destination device finishes, a dm event will be sent | 
|  | to user space. | 
|  |  | 
|  | Updating on-disk metadata | 
|  | ------------------------- | 
|  |  | 
|  | On-disk metadata is committed every time a FLUSH or FUA bio is written. If no | 
|  | such requests are made then commits will occur every second. This means the | 
|  | dm-clone device behaves like a physical disk that has a volatile write cache. If | 
|  | power is lost you may lose some recent writes. The metadata should always be | 
|  | consistent in spite of any crash. | 
|  |  | 
|  | Target Interface | 
|  | ================ | 
|  |  | 
|  | Constructor | 
|  | ----------- | 
|  |  | 
|  | :: | 
|  |  | 
|  | clone <metadata dev> <destination dev> <source dev> <region size> | 
|  | [<#feature args> [<feature arg>]* [<#core args> [<core arg>]*]] | 
|  |  | 
|  | ================ ============================================================== | 
|  | metadata dev     Fast device holding the persistent metadata | 
|  | destination dev  The destination device, where the source will be cloned | 
|  | source dev       Read only device containing the data that gets cloned | 
|  | region size      The size of a region in sectors | 
|  |  | 
|  | #feature args    Number of feature arguments passed | 
|  | feature args     no_hydration or no_discard_passdown | 
|  |  | 
|  | #core args       An even number of arguments corresponding to key/value pairs | 
|  | passed to dm-clone | 
|  | core args        Key/value pairs passed to dm-clone, e.g. `hydration_threshold | 
|  | 256` | 
|  | ================ ============================================================== | 
|  |  | 
|  | Optional feature arguments are: | 
|  |  | 
|  | ==================== ========================================================= | 
|  | no_hydration         Create a dm-clone instance with background hydration | 
|  | disabled | 
|  | no_discard_passdown  Disable passing down discards to the destination device | 
|  | ==================== ========================================================= | 
|  |  | 
|  | Optional core arguments are: | 
|  |  | 
|  | ================================ ============================================== | 
|  | hydration_threshold <#regions>   Maximum number of regions being copied from | 
|  | the source to the destination device at any | 
|  | one time, during background hydration. | 
|  | hydration_batch_size <#regions>  During background hydration, try to batch | 
|  | together contiguous regions, so we copy data | 
|  | from the source to the destination device in | 
|  | batches of this many regions. | 
|  | ================================ ============================================== | 
|  |  | 
|  | Status | 
|  | ------ | 
|  |  | 
|  | :: | 
|  |  | 
|  | <metadata block size> <#used metadata blocks>/<#total metadata blocks> | 
|  | <region size> <#hydrated regions>/<#total regions> <#hydrating regions> | 
|  | <#feature args> <feature args>* <#core args> <core args>* | 
|  | <clone metadata mode> | 
|  |  | 
|  | ======================= ======================================================= | 
|  | metadata block size     Fixed block size for each metadata block in sectors | 
|  | #used metadata blocks   Number of metadata blocks used | 
|  | #total metadata blocks  Total number of metadata blocks | 
|  | region size             Configurable region size for the device in sectors | 
|  | #hydrated regions       Number of regions that have finished hydrating | 
|  | #total regions          Total number of regions to hydrate | 
|  | #hydrating regions      Number of regions currently hydrating | 
|  | #feature args           Number of feature arguments to follow | 
|  | feature args            Feature arguments, e.g. `no_hydration` | 
|  | #core args              Even number of core arguments to follow | 
|  | core args               Key/value pairs for tuning the core, e.g. | 
|  | `hydration_threshold 256` | 
|  | clone metadata mode     ro if read-only, rw if read-write | 
|  |  | 
|  | In serious cases where even a read-only mode is deemed | 
|  | unsafe no further I/O will be permitted and the status | 
|  | will just contain the string 'Fail'. If the metadata | 
|  | mode changes, a dm event will be sent to user space. | 
|  | ======================= ======================================================= | 
|  |  | 
|  | Messages | 
|  | -------- | 
|  |  | 
|  | `disable_hydration` | 
|  | Disable the background hydration of the destination device. | 
|  |  | 
|  | `enable_hydration` | 
|  | Enable the background hydration of the destination device. | 
|  |  | 
|  | `hydration_threshold <#regions>` | 
|  | Set background hydration threshold. | 
|  |  | 
|  | `hydration_batch_size <#regions>` | 
|  | Set background hydration batch size. | 
|  |  | 
|  | Examples | 
|  | ======== | 
|  |  | 
|  | Clone a device containing a file system | 
|  | --------------------------------------- | 
|  |  | 
|  | 1. Create the dm-clone device. | 
|  |  | 
|  | :: | 
|  |  | 
|  | dmsetup create clone --table "0 1048576000 clone $metadata_dev $dest_dev \ | 
|  | $source_dev 8 1 no_hydration" | 
|  |  | 
|  | 2. Mount the device and trim the file system. dm-clone interprets the discards | 
|  | sent by the file system and it will not hydrate the unused space. | 
|  |  | 
|  | :: | 
|  |  | 
|  | mount /dev/mapper/clone /mnt/cloned-fs | 
|  | fstrim /mnt/cloned-fs | 
|  |  | 
|  | 3. Enable background hydration of the destination device. | 
|  |  | 
|  | :: | 
|  |  | 
|  | dmsetup message clone 0 enable_hydration | 
|  |  | 
|  | 4. When the hydration finishes, we can replace the dm-clone table with a linear | 
|  | table. | 
|  |  | 
|  | :: | 
|  |  | 
|  | dmsetup suspend clone | 
|  | dmsetup load clone --table "0 1048576000 linear $dest_dev 0" | 
|  | dmsetup resume clone | 
|  |  | 
|  | The metadata device is no longer needed and can be safely discarded or reused | 
|  | for other purposes. | 
|  |  | 
|  | Known issues | 
|  | ============ | 
|  |  | 
|  | 1. We redirect reads, to not-yet-hydrated regions, to the source device. If | 
|  | reading the source device has high latency and the user repeatedly reads from | 
|  | the same regions, this behaviour could degrade performance. We should use | 
|  | these reads as hints to hydrate the relevant regions sooner. Currently, we | 
|  | rely on the page cache to cache these regions, so we hopefully don't end up | 
|  | reading them multiple times from the source device. | 
|  |  | 
|  | 2. Release in-core resources, i.e., the bitmaps tracking which regions are | 
|  | hydrated, after the hydration has finished. | 
|  |  | 
|  | 3. During background hydration, if we fail to read the source or write to the | 
|  | destination device, we print an error message, but the hydration process | 
|  | continues indefinitely, until it succeeds. We should stop the background | 
|  | hydration after a number of failures and emit a dm event for user space to | 
|  | notice. | 
|  |  | 
|  | Why not...? | 
|  | =========== | 
|  |  | 
|  | We explored the following alternatives before implementing dm-clone: | 
|  |  | 
|  | 1. Use dm-cache with cache size equal to the source device and implement a new | 
|  | cloning policy: | 
|  |  | 
|  | * The resulting cache device is not a one-to-one mirror of the source device | 
|  | and thus we cannot remove the cache device once cloning completes. | 
|  |  | 
|  | * dm-cache writes to the source device, which violates our requirement that | 
|  | the source device must be treated as read-only. | 
|  |  | 
|  | * Caching is semantically different from cloning. | 
|  |  | 
|  | 2. Use dm-snapshot with a COW device equal to the source device: | 
|  |  | 
|  | * dm-snapshot stores its metadata in the COW device, so the resulting device | 
|  | is not a one-to-one mirror of the source device. | 
|  |  | 
|  | * No background copying mechanism. | 
|  |  | 
|  | * dm-snapshot needs to commit its metadata whenever a pending exception | 
|  | completes, to ensure snapshot consistency. In the case of cloning, we don't | 
|  | need to be so strict and can rely on committing metadata every time a FLUSH | 
|  | or FUA bio is written, or periodically, like dm-thin and dm-cache do. This | 
|  | improves the performance significantly. | 
|  |  | 
|  | 3. Use dm-mirror: The mirror target has a background copying/mirroring | 
|  | mechanism, but it writes to all mirrors, thus violating our requirement that | 
|  | the source device must be treated as read-only. | 
|  |  | 
|  | 4. Use dm-thin's external snapshot functionality. This approach is the most | 
|  | promising among all alternatives, as the thinly-provisioned volume is a | 
|  | one-to-one mirror of the source device and handles reads and writes to | 
|  | un-provisioned/not-yet-cloned areas the same way as dm-clone does. | 
|  |  | 
|  | Still: | 
|  |  | 
|  | * There is no background copying mechanism, though one could be implemented. | 
|  |  | 
|  | * Most importantly, we want to support arbitrary block devices as the | 
|  | destination of the cloning process and not restrict ourselves to | 
|  | thinly-provisioned volumes. Thin-provisioning has an inherent metadata | 
|  | overhead, for maintaining the thin volume mappings, which significantly | 
|  | degrades performance. | 
|  |  | 
|  | Moreover, cloning a device shouldn't force the use of thin-provisioning. On | 
|  | the other hand, if we wish to use thin provisioning, we can just use a thin | 
|  | LV as dm-clone's destination device. |