| David G. Korn, AT&T Labs |
| Joshua P. MacDonald, UC Berkeley |
| Jeffrey C. Mogul, Compaq WRL |
| Internet-Draft Kiem-Phong Vo, AT&T Labs |
| Expires: 09 November 2002 09 November 2001 |
| |
| |
| The VCDIFF Generic Differencing and Compression Data Format |
| |
| draft-korn-vcdiff-06.txt |
| |
| |
| |
| Status of this Memo |
| |
| This document is an Internet-Draft and is in full conformance |
| with all provisions of Section 10 of RFC2026. |
| |
| Internet-Drafts are working documents of the Internet Engineering |
| Task Force (IETF), its areas, and its working groups. Note that |
| other groups may also distribute working documents as |
| Internet-Drafts. |
| |
| Internet-Drafts are draft documents valid for a maximum of six |
| months and may be updated, replaced, or obsoleted by other |
| documents at any time. It is inappropriate to use Internet- |
| Drafts as reference material or to cite them other than as |
| "work in progress." |
| |
| The list of current Internet-Drafts can be accessed at |
| http://www.ietf.org/ietf/1id-abstracts.txt |
| |
| The list of Internet-Draft Shadow Directories can be accessed at |
| http://www.ietf.org/shadow.html. |
| |
| |
| Abstract |
| |
| This memo describes a general, efficient and portable data format |
| suitable for encoding compressed and/or differencing data so that |
| they can be easily transported among computers. |
| |
| |
| Table of Contents: |
| |
| 1. EXECUTIVE SUMMARY ............................................ 2 |
| 2. CONVENTIONS .................................................. 3 |
| 3. DELTA INSTRUCTIONS ........................................... 4 |
| 4. DELTA FILE ORGANIZATION ...................................... 5 |
| 5. DELTA INSTRUCTION ENCODING ................................... 9 |
| 6. DECODING A TARGET WINDOW ..................................... 14 |
| 7. APPLICATION-DEFINED CODE TABLES .............................. 16 |
| 8. PERFORMANCE .................................................. 16 |
| 9. FURTHER ISSUES ............................................... 17 |
| 10. SUMMARY ...................................................... 18 |
| 11. ACKNOWLEDGEMENTS ............................................. 18 |
| 12. SECURITY CONSIDERATIONS ...................................... 18 |
| 13. SOURCE CODE AVAILABILITY ..................................... 18 |
| 14. INTELLECTUAL PROPERTY RIGHTS ................................. 18 |
| 15. IANA CONSIDERATIONS .......................................... 19 |
| 16. REFERENCES ................................................... 19 |
| 17. AUTHOR'S ADDRESS ............................................. 20 |
| |
| |
| 1. EXECUTIVE SUMMARY |
| |
| Compression and differencing techniques can greatly improve storage |
| and transmission of files and file versions. Since files are often |
| transported across machines with distinct architectures and performance |
| characteristics, such data should be encoded in a form that is portable |
| and can be decoded with little or no knowledge of the encoders. |
| This document describes Vcdiff, a compact portable encoding format |
| designed for these purposes. |
| |
| Data differencing is the process of computing a compact and invertible |
| encoding of a "target file" given a "source file". Data compression |
| is similar but without the use of source data. The UNIX utilities diff, |
| compress, and gzip are well-known examples of data differencing and |
| compression tools. For data differencing, the computed encoding is |
| called a "delta file", and, for data compression, it is called |
| a "compressed file". Delta and compressed files are good for storage |
| and transmission as they are often smaller than the originals. |
| |
| Data differencing and data compression are traditionally treated |
| as distinct types of data processing. However, as shown in the Vdelta |
| technique by Korn and Vo [1], compression can be thought of as a special |
| case of differencing in which the source data is empty. The basic idea |
| is to unify the string parsing scheme used in the Lempel-Ziv'77 style |
| compressors [2], and the block-move technique of Tichy [3]. Loosely |
| speaking, this works as follows: |
| |
| a. Concatenate source and target data. |
| b. Parse the data from left to right as in LZ'77 but |
| make sure that a parsed segment starts the target data. |
| c. Start to output when reaching target data. |
| |
| Parsing is based on string matching algorithms such as suffix trees [4] |
| or hashing with different time and space performance characteristics. |
| Vdelta uses a fast string matching algorithm that requires less memory |
| than other techniques [5,6]. However, even with this algorithm, the |
| memory requirement can still be prohibitive for large files. A common |
| way to deal with memory limitation is to partition an input file into |
| chunks called "windows" and process them separately. Here, except for |
| unpublished work by Vo, little has been done on designing effective |
| windowing schemes. Current techniques, including Vdelta, simply use |
| source and target windows with corresponding addresses across source |
| and target files. |
| |
| String matching and windowing algorithms have large influence on the |
| compression rate of delta and compressed files. However, it is desirable |
| to have a portable encoding format that is independent of such algorithms. |
| This enables construction of client-server applications in which a server |
| may serve clients with unknown computing characteristics. Unfortunately, |
| all current differencing and compressing tools, including Vdelta, fall |
| short in this respect. Their storage formats are closely intertwined |
| with the implemented string matching and/or windowing algorithms. |
| |
| The encoding format Vcdiff proposed here addresses the above issues. |
| Vcdiff achieves the below characteristics: |
| |
| Output compactness: |
| The basic encoding format compactly represents compressed or delta |
| files. Applications can further extend the basic encoding format |
| with "secondary encoders" to achieve more compression. |
| |
| Data portability: |
| The basic encoding format is free from machine byte order and |
| word size issues. This allows data to be encoded on one machine |
| and decoded on a different machine with different architecture. |
| |
| Algorithm genericity: |
| The decoding algorithm is independent from string matching and |
| windowing algorithms. This allows competition among implementations |
| of the encoder while keeping the same decoder. |
| |
| Decoding efficiency: |
| Except for secondary encoder issues, the decoding algorithm runs |
| in time proportional to the size of the target file and uses space |
| proportional to the maximal window size. Vcdiff differs from more |
| conventional compressors in that it uses only byte-aligned |
| data, thus avoiding bit-level operations, which improves |
| decoding speed at the slight cost of compression efficiency. |
| |
| The Vcdiff data format and the algorithms for decoding data shall be |
| described next. Since Vcdiff treats compression as a special case of |
| differencing, we shall use the term "delta file" to indicate the |
| compressed output for both cases. |
| |
| |
| 2. CONVENTIONS |
| |
| The basic data unit is a byte. For portability, Vcdiff shall limit |
| a byte to its lower eight bits even on machines with larger bytes. |
| The bits in a byte are ordered from right to left so that the least |
| significant bit (LSB) has value 1, and the most significant bit (MSB), |
| has value 128. |
| |
| For purposes of exposition in this document, we adopt the convention |
| that the LSB is numbered 0, and the MSB is numbered 7. Bit numbers |
| never appear in the encoded format itself. |
| |
| Vcdiff encodes unsigned integer values using a portable variable-sized |
| format (originally introduced in the Sfio library [7]). This encoding |
| treats an integer as a number in base 128. Then, each digit in this |
| representation is encoded in the lower seven bits of a byte. Except for |
| the least significant byte, other bytes have their most significant bit |
| turned on to indicate that there are still more digits in the encoding. |
| The two key properties of this integer encoding that are beneficial |
| to a data compression format are: |
| |
| a. The encoding is portable among systems using 8-bit bytes, and |
| b. Small values are encoded compactly. |
| |
| For example, consider the value 123456789 which can be represented with |
| four 7-bit digits whose values are 58, 111, 26, 21 in order from most |
| to least significant. Below is the 8-bit byte encoding of these digits. |
| Note that the MSBs of 58, 111 and 26 are on. |
| |
| +-------------------------------------------+ |
| | 10111010 | 11101111 | 10011010 | 00010101 | |
| +-------------------------------------------+ |
| MSB+58 MSB+111 MSB+26 0+21 |
| |
| |
| Henceforth, the terms "byte" and "integer" will refer to a byte and an |
| unsigned integer as described. |
| |
| |
| From time to time, algorithms are exhibited to clarify the descriptions |
| of parts of the Vcdiff format. On such occasions, the C language will be |
| used to make precise the algorithms. The C code shown in this |
| document is meant for clarification only, and is not part of the |
| actual specification of the Vcdiff format. |
| |
| In this specification, the key words "MUST", "MUST NOT", |
| "SHOULD", "SHOULD NOT", and "MAY" document are to be interpreted as |
| described in RFC2119 [12]. |
| |
| |
| 3. DELTA INSTRUCTIONS |
| |
| A large target file is partitioned into non-overlapping sections |
| called "target windows". These target windows are processed separately |
| and sequentially based on their order in the target file. |
| |
| A target window T of length t may be compared against some source data |
| segment S of length s. By construction, this source data segment S |
| comes either from the source file, if one is used, or from a part of |
| the target file earlier than T. In this way, during decoding, S is |
| completely known when T is being decoded. |
| |
| The choices of T, t, S and s are made by some window selection algorithm |
| which can greatly affect the size of the encoding. However, as seen later, |
| these choices are encoded so that no knowledge of the window selection |
| algorithm is needed during decoding. |
| |
| Assume that S[j] represents the jth byte in S, and T[k] represents |
| the kth byte in T. Then, for the delta instructions, we treat the data |
| windows S and T as substrings of a superstring U formed by concatenating |
| them like this: |
| |
| S[0]S[1]...S[s-1]T[0]T[1]...T[t-1] |
| |
| The "address" of a byte in S or T is referred to by its location in U. |
| For example, the address of T[k] is s+k. |
| |
| The instructions to encode and direct the reconstruction of a target |
| window are called delta instructions. There are three types: |
| |
| ADD: This instruction has two arguments, a size x and a sequence of |
| x bytes to be copied. |
| COPY: This instruction has two arguments, a size x and an address p |
| in the string U. The arguments specify the substring of U that |
| must be copied. We shall assert that such a substring must be |
| entirely contained in either S or T. |
| RUN: This instruction has two arguments, a size x and a byte b that |
| will be repeated x times. |
| |
| Below are example source and target windows and the delta instructions |
| that encode the target window in terms of the source window. |
| |
| a b c d e f g h i j k l m n o p |
| a b c d w x y z e f g h e f g h e f g h e f g h z z z z |
| |
| COPY 4, 0 |
| ADD 4, w x y z |
| COPY 4, 4 |
| COPY 12, 24 |
| RUN 4, z |
| |
| |
| Thus, the first letter 'a' in the target window is at location 16 |
| in the superstring. Note that the fourth instruction, "COPY 12, 24", |
| copies data from T itself since address 24 is position 8 in T. |
| This instruction also shows that it is fine to overlap the data to be |
| copied with the data being copied from as long as the latter starts |
| earlier. This enables efficient encoding of periodic sequences, |
| i.e., sequences with regularly repeated subsequences. The RUN instruction |
| is a compact way to encode a sequence repeating the same byte even though |
| such a sequence can be thought of as a periodic sequence with period 1. |
| |
| To reconstruct the target window, one simply processes one delta |
| instruction at a time and copy the data either from the source window |
| or the being reconstructed target window based on the type of the |
| instruction and the associated address, if any. |
| |
| |
| 4. DELTA FILE ORGANIZATION |
| |
| A Vcdiff delta file starts with a Header section followed by a sequence |
| of Window sections. The Header section includes magic bytes to identify |
| the file type, and information concerning data processing beyond the |
| basic encoding format. The Window sections encode the target windows. |
| |
| Below is the overall organization of a delta file. The indented items |
| refine the ones immediately above them. An item in square brackets may |
| or may not be present in the file depending on the information encoded |
| in the Indicator byte above it. |
| |
| Header |
| Header1 - byte |
| Header2 - byte |
| Header3 - byte |
| Header4 - byte |
| Hdr_Indicator - byte |
| [Secondary compressor ID] - byte |
| |
| [@@@ Why is compressor ID not an integer? ] |
| [@@@ If we aren't defining any secondary compressors yet, then it seems |
| that defining the [Secondary compressor ID] and the corresponding |
| VCD_DECOMPRESS Hdr_Indicator bit in this draft has no real value. An |
| implementation of this specification won't be able to decode a VCDIFF |
| encoded with this option if it doesn't know about any secondary |
| compressors. It seems that you should specify the bits related to |
| secondary compressors once you have defined the first a secondary |
| compressor. I can imagine a secondary-compressor might want to supply |
| extra information, such as a dictionary of some kind, in which case |
| this speculative treatment wouldn't go far enough.] |
| |
| [Length of code table data] - integer |
| [Code table data] |
| Size of near cache - byte |
| Size of same cache - byte |
| Compressed code table data |
| Window1 |
| Win_Indicator - byte |
| [Source segment size] - integer |
| [Source segment position] - integer |
| The delta encoding of the target window |
| Length of the delta encoding - integer |
| The delta encoding |
| Size of the target window - integer |
| Delta_Indicator - byte |
| Length of data for ADDs and RUNs - integer |
| Length of instructions and sizes - integer |
| Length of addresses for COPYs - integer |
| Data section for ADDs and RUNs - array of bytes |
| Instructions and sizes section - array of bytes |
| Addresses section for COPYs - array of bytes |
| Window2 |
| ... |
| |
| |
| |
| 4.1 The Header Section |
| |
| Each delta file starts with a header section organized as below. |
| Note the convention that square-brackets enclose optional items. |
| |
| Header1 - byte = 0xE6 |
| Header2 - byte = 0xD3 |
| Header3 - byte = 0xD4 |
| |
| HMMM |
| |
| 0xD6 |
| 0xC3 |
| 0xC4 |
| |
| Header4 - byte |
| Hdr_Indicator - byte |
| [Secondary compressor ID] - byte |
| [Length of code table data] - integer |
| [Code table data] |
| |
| The first three Header bytes are the ASCII characters 'V', 'C' and 'D' |
| with their most significant bits turned on (in hexadecimal, the values |
| are 0xE6, 0xD3, and 0xD4). The fourth Header byte is currently set to |
| zero. In the future, it might be used to indicate the version of Vcdiff. |
| |
| The Hdr_Indicator byte shows if there are any initialization data |
| required to aid in the reconstruction of data in the Window sections. |
| This byte MAY have non-zero values for either, both, or neither of |
| the two bits VCD_DECOMPRESS and VCD_CODETABLE below: |
| |
| 7 6 5 4 3 2 1 0 |
| +-+-+-+-+-+-+-+-+ |
| | | | | | | | | | |
| +-+-+-+-+-+-+-+-+ |
| ^ ^ |
| | | |
| | +-- VCD_DECOMPRESS |
| +---- VCD_CODETABLE |
| |
| If bit 0 (VCD_DECOMPRESS) is non-zero, this indicates that a secondary |
| compressor may have been used to further compress certain parts of the |
| delta encoding data as described in Sections 4.3 and 6. In that case, |
| the ID of the secondary compressor is given next. If this bit is zero, |
| the compressor ID byte is not included. |
| |
| [@@@ If we aren't defining any secondary compressors yet, then it seems |
| this bit has no real value yet..] |
| |
| If bit 1 (VCD_CODETABLE) is non-zero, this indicates that an |
| application-defined code table is to be used for decoding the delta |
| instructions. This table itself is compressed. The length of the data |
| comprising this compressed code table and the data follow next. Section 7 |
| discusses application-defined code tables. If this bit is zero, the code |
| table data length and the code table data are not included. |
| |
| If both bits are set, then the compressor ID byte is included |
| before the code table data length and the code table data. |
| |
| |
| 4.2 The Format of a Window Section |
| |
| Each Window section is organized as follows: |
| |
| Win_Indicator - byte |
| [Source segment length] - integer |
| [Source segment position] - integer |
| The delta encoding of the target window |
| |
| |
| Below are the detail of the various items: |
| |
| [@@@ Here, I want to replace the Win_Indicator with a source-count, |
| followed by source-count length/position pairs?] |
| |
| Win_Indicator: |
| This byte is a set of bits, as shown: |
| |
| 7 6 5 4 3 2 1 0 |
| +-+-+-+-+-+-+-+-+ |
| | | | | | | | | | |
| +-+-+-+-+-+-+-+-+ |
| ^ ^ |
| | | |
| | +-- VCD_SOURCE |
| +---- VCD_TARGET |
| |
| |
| If bit 0 (VCD_SOURCE) is non-zero, this indicates that a segment |
| of data from the "source" file was used as the corresponding |
| source window of data to encode the target window. The decoder |
| will use this same source data segment to decode the target window. |
| |
| If bit 1 (VCD_TARGET) is non-zero, this indicates that a segment |
| of data from the "target" file was used as the corresponding |
| source window of data to encode the target window. As above, this |
| same source data segment is used to decode the target window. |
| |
| The Win_Indicator byte MUST NOT have more than one of the bits |
| set (non-zero). It MAY have none of these bits set. |
| |
| If one of these bits is set, the byte is followed by two |
| integers to indicate respectively the length and position of |
| the source data segment in the relevant file. If the |
| indicator byte is zero, the target window was compressed |
| by itself without comparing against another data segment, |
| and these two integers are not included. |
| |
| The delta encoding of the target window: |
| This contains the delta encoding of the target window either |
| in terms of the source data segment (i.e., VCD_SOURCE |
| or VCD_TARGET was set) or by itself if no source window |
| is specified. This data format is discussed next. |
| |
| |
| 4.3 The Delta Encoding of a Target Window |
| |
| The delta encoding of a target window is organized as follows: |
| |
| Length of the delta encoding - integer |
| The delta encoding |
| Length of the target window - integer |
| Delta_Indicator - byte |
| Length of data for ADDs and RUNs - integer |
| Length of instructions section - integer |
| Length of addresses for COPYs - integer |
| Data section for ADDs and RUNs - array of bytes |
| Instructions and sizes section - array of bytes |
| Addresses section for COPYs - array of bytes |
| |
| |
| Length of the delta encoding: |
| This integer gives the total number of remaining bytes that |
| comprise data of the delta encoding for this target window. |
| |
| The delta encoding: |
| This contains the data representing the delta encoding which |
| is described next. |
| |
| Length of the target window: |
| This integer indicates the actual size of the target window |
| after decompression. A decoder can use this value to allocate |
| memory to store the uncompressed data. |
| |
| Delta_Indicator: |
| This byte is a set of bits, as shown: |
| |
| 7 6 5 4 3 2 1 0 |
| +-+-+-+-+-+-+-+-+ |
| | | | | | | | | | |
| +-+-+-+-+-+-+-+-+ |
| ^ ^ ^ |
| | | | |
| | | +-- VCD_DATACOMP |
| | +---- VCD_INSTCOMP |
| +------ VCD_ADDRCOMP |
| |
| VCD_DATACOMP: bit value 1. |
| VCD_INSTCOMP: bit value 2. |
| VCD_ADDRCOMP: bit value 4. |
| |
| As discussed, the delta encoding consists of COPY, ADD and RUN |
| instructions. The ADD and RUN instructions have accompanying |
| unmatched data (that is, data that does not specifically match |
| any data in the source window or in some earlier part of the |
| target window) and the COPY instructions have addresses of where |
| the matches occur. OPTIONALLY, these types of data MAY be further |
| compressed using a secondary compressor. Thus, Vcdiff separates |
| the encoding of the delta instructions into three parts: |
| |
| a. The unmatched data in the ADD and RUN instructions, |
| b. The delta instructions and accompanying sizes, and |
| c. The addresses of the COPY instructions. |
| |
| If the bit VCD_DECOMPRESS (Section 4.1) was on, each of these |
| sections may have been compressed using the specified secondary |
| compressor. The bit positions 0 (VCD_DATACOMP), 1 (VCD_INSTCOMP), |
| and 2 (VCD_ADDRCOMP) respectively indicate, if non-zero, that |
| the corresponding parts are compressed. Then, these parts MUST |
| be decompressed before decoding the delta instructions. |
| |
| Length of data for ADDs and RUNs: |
| This is the length (in bytes) of the section of data storing |
| the unmatched data accompanying the ADD and RUN instructions. |
| |
| Length of instructions section: |
| This is the length (in bytes) of the delta instructions and |
| accompanying sizes. |
| |
| Length of addresses for COPYs: |
| This is the length (in bytes) of the section storing |
| the addresses of the COPY instructions. |
| |
| Data section for ADDs and RUNs: |
| This sequence of bytes encodes the unmatched data for the ADD |
| and RUN instructions. |
| |
| Instructions and sizes section: |
| This sequence of bytes encodes the instructions and their sizes. |
| |
| Addresses section for COPYs: |
| This sequence of bytes encodes the addresses of the COPY |
| instructions. |
| |
| |
| 5. DELTA INSTRUCTION ENCODING |
| |
| The delta instructions described in Section 3 represent the results of |
| string matching. For many data differencing applications in which the |
| changes between source and target data are small, any straightforward |
| representation of these instructions would be adequate. However, for |
| applications including data compression, it is important to encode |
| these instructions well to achieve good compression rates. From our |
| experience, the following observations can be made: |
| |
| a. The addresses in COPY instructions are locations of matches and |
| often occur close by or even exactly equal to one another. This is |
| because data in local regions are often replicated with minor changes. |
| In turn, this means that coding a newly matched address against some |
| set of recently matched addresses can be beneficial. |
| |
| b. The matches are often short in length and separated by small amounts |
| of unmatched data. That is, the lengths of COPY and ADD instructions |
| are often small. This is particularly true of binary data such as |
| executable files or structured data such as HTML or XML. In such cases, |
| compression can be improved by combining the encoding of the sizes |
| and the instruction types as well as combining the encoding of adjacent |
| delta instructions with sufficiently small data sizes. |
| |
| The below subsections discuss how the Vcdiff data format provides |
| mechanisms enabling encoders to use the above observations to improve |
| compression rates. |
| |
| |
| 5.1 Address Encoding Modes of COPY Instructions |
| |
| As mentioned earlier, addresses of COPY instructions often occur close |
| to one another or are exactly equal. To take advantage of this phenomenon |
| and encode addresses of COPY instructions more efficiently, the Vcdiff |
| data format supports the use of two different types of address caches. |
| Both the encoder and decoder maintain these caches, so that decoder's |
| caches remain synchronized with the encoder's caches. |
| |
| a. A "near" cache is an array with "s_near" slots, each containing an |
| address used for encoding addresses nearby to previously encoded |
| addresses (in the positive direction only). The near cache also |
| maintains a "next_slot" index to the near cache. New entries to the |
| near cache are always inserted in the next_slot index, which maintains |
| a circular buffer of the s_near most recent addresses. |
| |
| b. A "same" cache is an array with "s_same" multiple of 256 slots, each |
| containing an address. The same cache maintains a hash table of recent |
| addresses used for repeated encoding of the exact same address. |
| |
| |
| By default, the parameters s_near and s_same are respectively set to 4 |
| and 3. An encoder MAY modify these values, but then it MUST encode the |
| new values in the encoding itself, as discussed in Section 7, so that |
| the decoder can properly set up its own caches. |
| |
| At the start of processing a target window, an implementation |
| (encoder or decoder) initializes all of the slots in both caches |
| to zero. The next_slot pointer of the near cache is set |
| to point to slot zero. |
| |
| Each time a COPY instruction is processed by the encoder or |
| decoder, the implementation's caches are updated as follows, where |
| "addr" is the address in the COPY instruction. |
| |
| a. The slot in the near cache referenced by the next_slot |
| index is set to addr. The next_slot index is then incremented |
| modulo s_near. |
| |
| b. The slot in the same cache whose index is addr%(s_same*256) |
| is set to addr. [We use the C notations of % for modulo and |
| * for multiplication.] |
| |
| |
| 5.2 Example code for maintaining caches |
| |
| To make clear the above description, below are example cache data |
| structures and algorithms to initialize and update them: |
| |
| typedef struct _cache_s |
| { |
| int* near; /* array of size s_near */ |
| int s_near; |
| int next_slot; /* the circular index for near */ |
| int* same; /* array of size s_same*256 */ |
| int s_same; |
| } Cache_t; |
| |
| cache_init(Cache_t* ka) |
| { |
| int i; |
| |
| ka->next_slot = 0; |
| for(i = 0; i < ka->s_near; ++i) |
| ka->near[i] = 0; |
| |
| for(i = 0; i < ka->s_same*256; ++i) |
| ka->same[i] = 0; |
| } |
| |
| cache_update(Cache_t* ka, int addr) |
| { |
| if(ka->s_near > 0) |
| { ka->near[ka->next_slot] = addr; |
| ka->next_slot = (ka->next_slot + 1) % ka->s_near; |
| } |
| |
| if(ka->s_same > 0) |
| ka->same[addr % (ka->s_same*256)] = addr; |
| } |
| |
| |
| 5.3 Encoding of COPY instruction addresses |
| |
| The address of a COPY instruction is encoded using different modes |
| depending on the type of cached address used, if any. |
| |
| Let "addr" be the address of a COPY instruction to be decoded and "here" |
| be the current location in the target data (i.e., the start of the data |
| about to be encoded or decoded). Let near[j] be the jth element in |
| the near cache, and same[k] be the kth element in the same cache. |
| Below are the possible address modes: |
| |
| VCD_SELF: This mode has value 0. The address was encoded by itself |
| as an integer. |
| |
| VCD_HERE: This mode has value 1. The address was encoded as |
| the integer value "here - addr". |
| |
| Near modes: The "near modes" are in the range [2,s_near+1]. Let m |
| be the mode of the address encoding. The address was encoded |
| as the integer value "addr - near[m-2]". |
| |
| Same modes: The "same modes" are in the range |
| [s_near+2,s_near+s_same+1]. Let m be the mode of the encoding. |
| The address was encoded as a single byte b such that |
| "addr == same[(m - (s_near+2))*256 + b]". |
| |
| |
| 5.3 Example code for encoding and decoding of COPY instruction addresses |
| |
| We show example algorithms below to demonstrate use of address modes more |
| clearly. The encoder has freedom to choose address modes, the sample |
| addr_encode() algorithm merely shows one way of picking the address |
| mode. The decoding algorithm addr_decode() will uniquely decode addresses |
| regardless of the encoder's algorithm choice. |
| |
| Note that the address caches are updated immediately after an address is |
| encoded or decoded. In this way, the decoder is always synchronized with |
| the encoder. |
| |
| int addr_encode(Cache_t* ka, int addr, int here, int* mode) |
| { |
| int i, d, bestd, bestm; |
| |
| /* Attempt to find the address mode that yields the |
| * smallest integer value for "d", the encoded address |
| * value, thereby minimizing the encoded size of the |
| * address. */ |
| |
| bestd = addr; bestm = VCD_SELF; /* VCD_SELF == 0 */ |
| |
| if((d = here-addr) < bestd) |
| { bestd = d; bestm = VCD_HERE; } /* VCD_HERE == 1 */ |
| |
| for(i = 0; i < ka->s_near; ++i) |
| if((d = addr - ka->near[i]) >= 0 && d < bestd) |
| { bestd = d; bestm = i+2; } |
| |
| if(ka->s_same > 0 && ka->same[d = addr%(ka->s_same*256)] == addr) |
| { bestd = d%256; bestm = ka->s_near + 2 + d/256; } |
| |
| cache_update(ka,addr); |
| |
| *mode = bestm; /* this returns the address encoding mode */ |
| return bestd; /* this returns the encoded address */ |
| } |
| |
| Note that the addr_encode() algorithm chooses the best address mode using a |
| local optimization, but that may not lead to the best encoding efficiency |
| because different modes lead to different instruction encodings, as described below. |
| |
| The functions addrint() and addrbyte() used in addr_decode() obtain from |
| the "Addresses section for COPYs" (Section 4.3) an integer or a byte, |
| respectively. These utilities will not be described here. We simply |
| recall that an integer is represented as a compact variable-sized string |
| of bytes as described in Section 2 (i.e., base 128). |
| |
| int addr_decode(Cache_t* ka, int here, int mode) |
| { int addr, m; |
| |
| if(mode == VCD_SELF) |
| addr = addrint(); |
| else if(mode == VCD_HERE) |
| addr = here - addrint(); |
| else if((m = mode - 2) >= 0 && m < ka->s_near) /* near cache */ |
| addr = ka->near[m] + addrint(); |
| else /* same cache */ |
| { m = mode - (2 + ka->s_near); |
| addr = ka->same[m*256 + addrbyte()]; |
| } |
| |
| cache_update(ka, addr); |
| |
| return addr; |
| } |
| |
| |
| 5.4 Instruction Codes |
| |
| As noted, the data sizes associated with delta instructions are often |
| small. Thus, compression efficiency can be improved by combining the sizes |
| and instruction types in a single encoding, as well by combining certain |
| pairs of adjacent delta instructions. Effective choices of when to perform |
| such combinations depend on many factors including the data being processed |
| and the string matching algorithm in use. For example, if many COPY |
| instructions have the same data sizes, it may be worth to encode these |
| instructions more compactly than others. |
| |
| The Vcdiff data format is designed so that a decoder does not need to be |
| aware of the choices made in encoding algorithms. This is achieved with the |
| notion of an "instruction code table" containing 256 entries. Each entry |
| defines either a single delta instruction or a pair of instructions that |
| have been combined. Note that the code table itself only exists in main |
| memory, not in the delta file (unless using an application-defined code |
| table, described in Section 7). The encoded data simply includes the index |
| of each instruction and, since there are only 256 indices, each index |
| can be represented as a single byte. |
| |
| Each instruction code entry contains six fields, each of which |
| is a single byte with unsigned value: |
| |
| +-----------------------------------------------+ |
| | inst1 | size1 | mode1 | inst2 | size2 | mode2 | |
| +-----------------------------------------------+ |
| |
| @@@ could be more compact |
| |
| Each triple (inst,size,mode) defines a delta instruction. The meanings |
| of these fields are as follows: |
| |
| inst: An "inst" field can have one of the four values: NOOP (0), ADD (1), |
| RUN (2) or COPY (3) to indicate the instruction types. NOOP means |
| that no instruction is specified. In this case, both the corresponding |
| size and mode fields will be zero. |
| |
| size: A "size" field is zero or positive. A value zero means that the |
| size associated with the instruction is encoded separately as |
| an integer in the "Instructions and sizes section" (Section 6). |
| A positive value for "size" defines the actual data size. |
| Note that since the size is restricted to a byte, the maximum |
| value for any instruction with size implicitly defined in the code |
| table is 255. |
| |
| mode: A "mode" field is significant only when the associated delta |
| instruction is a COPY. It defines the mode used to encode the |
| associated addresses. For other instructions, this is always zero. |
| |
| |
| 5.5 The Code Table |
| |
| Following the discussions on address modes and instruction code tables, |
| we define a "Code Table" to have the data below: |
| |
| s_near: the size of the near cache, |
| s_same: the size of the same cache, |
| i_code: the 256-entry instruction code table. |
| |
| Vcdiff itself defines a "default code table" in which s_near is 4 |
| and s_same is 3. Thus, there are 9 address modes for a COPY instruction. |
| The first two are VCD_SELF (0) and VCD_HERE (1). Modes 2, 3, 4 and 5 |
| are for addresses coded against the near cache. And, modes 6, 7 and 8 |
| are for addresses coded against the same cache. |
| |
| The default instruction code table is depicted below, in a compact |
| representation that we use only for descriptive purposes. See section 7 |
| for the specification of how an instruction code table is represented |
| in the Vcdiff encoding format. In the depiction, a zero value for |
| size indicates that the size is separately coded. The mode of non-COPY |
| instructions is represented as 0 even though they are not used. |
| |
| |
| TYPE SIZE MODE TYPE SIZE MODE INDEX |
| --------------------------------------------------------------- |
| 1. RUN 0 0 NOOP 0 0 0 |
| 2. ADD 0, [1,17] 0 NOOP 0 0 [1,18] |
| 3. COPY 0, [4,18] 0 NOOP 0 0 [19,34] |
| 4. COPY 0, [4,18] 1 NOOP 0 0 [35,50] |
| 5. COPY 0, [4,18] 2 NOOP 0 0 [51,66] |
| 6. COPY 0, [4,18] 3 NOOP 0 0 [67,82] |
| 7. COPY 0, [4,18] 4 NOOP 0 0 [83,98] |
| 8. COPY 0, [4,18] 5 NOOP 0 0 [99,114] |
| 9. COPY 0, [4,18] 6 NOOP 0 0 [115,130] |
| 10. COPY 0, [4,18] 7 NOOP 0 0 [131,146] |
| 11. COPY 0, [4,18] 8 NOOP 0 0 [147,162] |
| 12. ADD [1,4] 0 COPY [4,6] 0 [163,174] |
| 13. ADD [1,4] 0 COPY [4,6] 1 [175,186] |
| 14. ADD [1,4] 0 COPY [4,6] 2 [187,198] |
| 15. ADD [1,4] 0 COPY [4,6] 3 [199,210] |
| 16. ADD [1,4] 0 COPY [4,6] 4 [211,222] |
| 17. ADD [1,4] 0 COPY [4,6] 5 [223,234] |
| 18. ADD [1,4] 0 COPY 4 6 [235,238] |
| 19. ADD [1,4] 0 COPY 4 7 [239,242] |
| 20. ADD [1,4] 0 COPY 4 8 [243,246] |
| 21. COPY 4 [0,8] ADD 1 0 [247,255] |
| --------------------------------------------------------------- |
| |
| In the above depiction, each numbered line represents one or more |
| entries in the actual instruction code table (recall that an entry in |
| the instruction code table may represent up to two combined delta |
| instructions.) The last column ("INDEX") shows which index value or |
| range of index values of the entries covered by that line. The notation |
| [i,j] means values from i through j, inclusive. The first 6 columns of |
| a line in the depiction describe the pairs of instructions used for |
| the corresponding index value(s). |
| |
| If a line in the depiction includes a column entry using the [i,j] |
| notation, this means that the line is instantiated for each value |
| in the range from i to j, inclusive. The notation "0, [i,j]" means |
| that the line is instantiated for the value 0 and for each value |
| in the range from i to j, inclusive. |
| |
| If a line in the depiction includes more than one entry using the [i,j] |
| notation, implying a "nested loop" to convert the line to a range of |
| table entries, the first such [i,j] range specifies the outer loop, |
| and the second specifies the inner loop. |
| |
| The below examples should make clear the above description: |
| |
| Line 1 shows the single RUN instruction with index 0. As the size field |
| is 0, this RUN instruction always has its actual size encoded separately. |
| |
| Line 2 shows the 18 single ADD instructions. The ADD instruction with |
| size field 0 (i.e., the actual size is coded separately) has index 1. |
| ADD instructions with sizes from 1 to 17 use code indices 2 to 18 and |
| their sizes are as given (so they will not be separately encoded.) |
| |
| Following the single ADD instructions are the single COPY instructions |
| ordered by their address encoding modes. For example, line 11 shows the |
| COPY instructions with mode 8, i.e., the last of the same cache. |
| In this case, the COPY instruction with size field 0 has index 147. |
| Again, the actual size of this instruction will be coded separately. |
| |
| Lines 12 to 21 show the pairs of instructions that are combined together. |
| For example, line 12 depicts the 12 entries in which an ADD instruction |
| is combined with an immediately following COPY instruction. The entries |
| with indices 163, 164, 165 represent the pairs in which the ADD |
| instructions all have size 1 while the COPY instructions has mode |
| 0 (VCD_SELF) and sizes 4, 5 and 6 respectively. |
| |
| The last line, line 21, shows the eight instruction pairs where the first |
| instruction is a COPY and the second is an ADD. In this case, all COPY |
| instructions have size 4 with mode ranging from 0 to 8 and all the ADD |
| instructions have size 1. Thus, the entry with largest index 255 |
| combines a COPY instruction of size 4 and mode 8 with an ADD instruction |
| of size 1. |
| |
| The choice of the minimum size 4 for COPY instructions in the default code |
| table was made from experiments that showed that excluding small matches |
| (less then 4 bytes long) improved the compression rates. |
| |
| |
| 6. DECODING A TARGET WINDOW |
| |
| Section 4.3 discusses that the delta instructions and associated data |
| are encoded in three arrays of bytes: |
| |
| Data section for ADDs and RUNs, |
| Instructions and sizes section, and |
| Addresses section for COPYs. |
| |
| |
| Further, these data sections may have been further compressed by some |
| secondary compressor. Assuming that any such compressed data has been |
| decompressed so that we now have three arrays: |
| |
| inst: bytes coding the instructions and sizes. |
| data: unmatched data associated with ADDs and RUNs. |
| addr: bytes coding the addresses of COPYs. |
| |
| These arrays are organized as follows: |
| |
| inst: |
| a sequence of (index, [size1], [size2]) tuples, where "index" |
| is an index into the instruction code table, and size1 and size2 |
| are integers that MAY or MAY NOT be included in the tuple as |
| follows. The entry with the given "index" in the instruction |
| code table potentially defines two delta instructions. If the |
| first delta instruction is not a VCD_NOOP and its size is zero, |
| then size1 MUST be present. Otherwise, size1 MUST be omitted and |
| the size of the instruction (if it is not VCD_NOOP) is as defined |
| in the table. The presence or absence of size2 is defined |
| similarly with respect to the second delta instruction. |
| |
| data: |
| a sequence of data values, encoded as bytes. |
| |
| addr: |
| a sequence of address values. Addresses are normally encoded as |
| integers as described in Section 2 (i.e., base 128). |
| Since the same cache emits addresses in the range [0,255], |
| however, same cache addresses are always encoded as a |
| single byte. |
| |
| To summarize, each tuple in the "inst" array includes an index to some |
| entry in the instruction code table that determines: |
| |
| a. Whether one or two instructions were encoded and their types. |
| |
| b. If the instructions have their sizes encoded separately, these |
| sizes will follow, in order, in the tuple. |
| |
| c. If the instructions have accompanying data, i.e., ADDs or RUNs, |
| their data will be in the array "data". |
| |
| d. Similarly, if the instructions are COPYs, the coded addresses are |
| found in the array "addr". |
| |
| The decoding procedure simply processes the arrays by reading one code |
| index at a time, looking up the corresponding instruction code entry, |
| then consuming the respective sizes, data and addresses following the |
| directions in this entry. In other words, the decoder maintains an implicit |
| next-element pointer for each array; "consuming" an instruction tuple, |
| data, or address value implies incrementing the associated pointer. |
| |
| For example, if during the processing of the target window, the next |
| unconsumed tuple in the inst array has index value 19, then the first |
| instruction is a COPY, whose size is found as the immediately following |
| integer in the inst array. Since the mode of this COPY instruction is |
| VCD_SELF, the corresponding address is found by consuming the next |
| integer in the addr array. The data array is left intact. As the second |
| instruction for code index 19 is a NOOP, this tuple is finished. |
| |
| |
| 7. APPLICATION-DEFINED CODE TABLES |
| |
| Although the default code table used in Vcdiff is good for general |
| purpose encoders, there are times when other code tables may perform |
| better. For example, to code a file with many identical segments of data, |
| it may be advantageous to have a COPY instruction with the specific size |
| of these data segments so that the instruction can be encoded in a single |
| byte. Such a special code table MUST then be encoded in the delta file |
| so that the decoder can reconstruct it before decoding the data. |
| |
| Vcdiff allows an application-defined code table to be specified |
| in a delta file with the following data: |
| |
| Size of near cache - byte |
| Size of same cache - byte |
| Compressed code table data |
| |
| The "compressed code table data" encodes the delta between the default |
| code table (source) and the new code table (target) in the same manner as |
| described in Section 4.3 for encoding a target window in terms of a |
| source window. This delta is computed using the following steps: |
| |
| a. Convert the new instruction code table into a string, "code", of |
| 1536 bytes using the below steps in order: |
| |
| i. Add in order the 256 bytes representing the types of the first |
| instructions in the instruction pairs. |
| ii. Add in order the 256 bytes representing the types of the second |
| instructions in the instruction pairs. |
| iii. Add in order the 256 bytes representing the sizes of the first |
| instructions in the instruction pairs. |
| iv. Add in order the 256 bytes representing the sizes of the second |
| instructions in the instruction pairs. |
| v. Add in order the 256 bytes representing the modes of the first |
| instructions in the instruction pairs. |
| vi. Add in order the 256 bytes representing the modes of the second |
| instructions in the instruction pairs. |
| |
| b. Similarly, convert the default instruction code table into |
| a string "dflt". |
| |
| c. Treat the string "code" as a target window and "dflt" as the |
| corresponding source data and apply an encoding algorithm to |
| compute the delta encoding of "code" in terms of "dflt". |
| This computation MUST use the default code table for encoding |
| the delta instructions. |
| |
| The decoder can then reverse the above steps to decode the compressed |
| table data using the method of Section 6, employing the default code |
| table, to generate the new code table. Note that the decoder does not |
| need to know anything about the details of the encoding algorithm used |
| in step (c). The decoder is still able to decode the new code table |
| because the Vcdiff format is independent from the choice of encoding |
| algorithm, and because the encoder in step (c) uses the known, default |
| code table. |
| |
| |
| 8. PERFORMANCE |
| |
| The encoding format is compact. For compression only, using the LZ-77 |
| string parsing strategy and without any secondary compressors, the typical |
| compression rate is better than Unix compress and close to gzip. For |
| differencing, the data format is better than all known methods in |
| terms of its stated goal, which is primarily decoding speed and |
| encoding efficiency. |
| |
| We compare the performance of compress, gzip and Vcdiff using the |
| archives of three versions of the Gnu C compiler, gcc-2.95.1.tar, |
| gcc-2.95.2.tar and gcc-2.95.3.tar. The experiments were done on an |
| SGI-MIPS3, 400MHZ. Gzip was used at its default compression level. |
| Vcdiff timings were done using the Vcodex/Vcdiff software (Section 13). |
| As string and window matching typically dominates the computation during |
| compression, the Vcdiff compression times were directly due to the |
| algorithms used in the Vcodex/Vcdiff software. However, the decompression |
| times should be generic and representative of any good implementation |
| of the Vcdiff data format. Timing was done by running each program |
| three times and taking the average of the total cpu+system times. |
| |
| Below are the different Vcdiff runs: |
| |
| Vcdiff: vcdiff is used as compressor only. |
| |
| Vcdiff-d: vcdiff is used as a differencer only. That is, it only |
| compares target data against source data. Since the files |
| involved are large, they are broken into windows. In this |
| case, each target window starting at some file offset in |
| the target file is compared against a source window with |
| the same file offset (in the source file). The source |
| window is also slightly larger than the target window |
| to increase matching opportunities. The -d option also gives |
| a hint to the string matching algorithm of Vcdiff that |
| the two files are very similar with long stretches of matches. |
| The algorithm takes advantage of this to minimize its |
| processing of source data and save time. |
| |
| Vcdiff-dc: This is similar to Vcdiff-d but vcdiff can also compare |
| target data against target data as applicable. Thus, vcdiff |
| both computes differences and compresses data. The windowing |
| algorithm is the same as above. However, the above hint is |
| recinded in this case. |
| |
| Vcdiff-dcs: This is similar to Vcdiff-dc but the windowing algorithm |
| uses a content-based heuristic to select source data segments |
| that are more likely to match with a given target window. |
| Thus, the source data segment selected for a target window |
| often will not be aligned with the file offsets of this |
| target window. |
| |
| |
| gcc-2.95.1 gcc-2.95.2 compression decompression |
| raw size 55746560 55797760 |
| compress - 19939390 13.85s 7.09s |
| gzip - 12973443 42.99s 5.35s |
| Vcdiff - 15358786 20.04s 4.65s |
| Vcdiff-d - 100971 10.93s 1.92s |
| Vcdiff-dc - 97246 20.03s 1.84s |
| Vcdiff-dcs - 256445 44.81s 1.84s |
| |
| TABLE 1. Compressing gcc-2.95.2.tar given gcc-2.95.1 |
| |
| |
| TABLE 1 shows the raw sizes of gcc-2.95.1.tar and gcc-2.95.2.tar and the |
| sizes of the compressed results. As a pure compressor, the compression |
| rate for Vcdiff is worse than gzip and better than compress. The last |
| three rows shows that when two file versions are very similar, differencing |
| can have dramatically good compression rates. Vcdiff-d and Vcdiff-dc use |
| the same simple window selection method but Vcdiff-dc also does compression |
| so its output is slightly smaller. Vcdiff-dcs uses a heuristic based on |
| data content to search for source data that likely will match a given target |
| window. Although it does a good job, the heuristic did not always find the |
| best matches which are given by the simple algorithm of Vcdiff-d. As a |
| result, the output size is slightly larger. Note also that there is a large |
| cost in computing matching windows this way. Finally, the compression times |
| of Vcdiff-d is nearly half of that of Vcdiff-dc. It is tempting to conclude |
| that the compression feature causes the additional time in Vcdiff-dc |
| relative to Vcdiff-d. However, this is not the case. The hint given to |
| the Vcdiff string matching algorithm that the two files are likely to |
| have very long stretches of matches helps the algorithm to minimize |
| processing of the "source data", thus saving half the time. However, as we |
| shall see below when this hint is wrong, the result is even longer time. |
| |
| |
| gcc-2.95.2 gcc-2.95.3 compression decompression |
| raw size 55797760 55787520 |
| compress - 19939453 13.54s 7.00s |
| gzip - 12998097 42.63s 5.62s |
| Vcdiff - 15371737 20.09s 4.74s |
| Vcdiff-d - 26383849 71.41s 6.41s |
| Vcdiff-dc - 14461203 42.48s 4.82s |
| Vcdiff-dcs - 1248543 61.18s 1.99s |
| |
| TABLE 2. Compressing gcc-2.95.3.tar given gcc-2.95.2 |
| |
| |
| TABLE 2 shows the raw sizes of gcc-2.95.2.tar and gcc-2.95.3.tar and |
| the sizes of the compressed results. In this case, the tar file of |
| gcc-2.95.3 is rearranged in a way that makes the straightforward method |
| of matching file offsets for source and target windows fail. As a |
| result, Vcdiff-d performs rather dismally both in time and output size. |
| The large time for Vcdiff-d is directly due to fact that the string |
| matching algorithm has to work much harder to find matches when the hint |
| that two files have long matching stretches fails to hold. On the other |
| hand, Vcdiff-dc does both differencing and compression resulting in good |
| output size. Finally, the window searching heuristic used in Vcdiff-dcs is |
| effective in finding the right matching source windows for target windows |
| resulting a small output size. This shows why the data format needs to |
| have a way to specify matching windows to gain performance. Finally, |
| we note that the decoding times are always good regardless of how |
| the string matching or window searching algorithms perform. |
| |
| |
| 9. FURTHER ISSUES |
| |
| This document does not address a few issues: |
| |
| Secondary compressors: |
| As discussed in Section 4.3, certain sections in the delta encoding |
| of a window may be further compressed by a secondary compressor. |
| In our experience, the basic Vcdiff format is adequate for most |
| purposes so that secondary compressors are seldom needed. In |
| particular, for normal use of data differencing where the files to |
| be compared have long stretches of matches, much of the gain in |
| compression rate is already achieved by normal string matching. |
| Thus, the use of secondary compressors is seldom needed in this case. |
| However, for applications beyond differencing of such nearly identical |
| files, secondary compressors may be needed to achieve maximal |
| compressed results. |
| |
| Therefore, we recommend to leave the Vcdiff data format defined |
| as in this document so that the use of secondary compressors |
| can be implemented when they become needed in the future. |
| The formats of the compressed data via such compressors or any |
| compressors that may be defined in the future are left open to |
| their implementations. These could include Huffman encoding, |
| arithmetic encoding, and splay tree encoding [8,9]. |
| |
| Large file system vs. small file system: |
| As discussed in Section 4, a target window in a large file may be |
| compared against some source window in another file or in the same |
| file (from some earlier part). In that case, the file offset of the |
| source window is specified as a variable-sized integer in the delta |
| encoding. There is a possibility that the encoding was computed on |
| a system supporting much larger files than in a system where |
| the data may be decoded (e.g., 64-bit file systems vs. 32-bit file |
| systems). In that case, some target data may not be recoverable. |
| This problem could afflict any compression format, and ought |
| to be resolved with a generic negotiation mechanism in the |
| appropriate protocol(s). |
| |
| |
| 10. SUMMARY |
| |
| We have described Vcdiff, a general and portable encoding format for |
| compression and differencing. The format is good in that it allows |
| implementing a decoder without knowledge of the encoders. Further, |
| ignoring the use of secondary compressors not defined within the format, |
| the decoding algorithms runs in linear time and requires working space |
| proportional to window sizes. |
| |
| |
| |
| 11. ACKNOWLEDGEMENTS |
| |
| Thanks are due to Balachander Krishnamurthy, Jeff Mogul and Arthur Van Hoff |
| who provided much encouragement to publicize Vcdiff. In particular, Jeff |
| helped clarifying the description of the data format presented here. |
| |
| |
| |
| 12. SECURITY CONSIDERATIONS |
| |
| Vcdiff only provides a format to encode compressed and differenced data. |
| It does not address any issues concerning how such data are, in fact, |
| stored in a given file system or the run-time memory of a computer system. |
| Therefore, we do not anticipate any security issues with respect to Vcdiff. |
| |
| |
| |
| 13. SOURCE CODE AVAILABILITY |
| |
| Vcdiff is implemented as a data transforming method in Phong Vo's |
| Vcodex library. AT&T Corp. has made the source code for Vcodex available |
| for anyone to use to transmit data via HTTP/1.1 Delta Encoding [10,11]. |
| The source code and according license is accessible at the below URL: |
| |
| http://www.research.att.com/sw/tools |
| |
| |
| 14. INTELLECTUAL PROPERTY RIGHTS |
| |
| The IETF has been notified of intellectual property rights claimed in |
| regard to some or all of the specification contained in this |
| document. For more information consult the online list of claimed |
| rights, at <http://www.ietf.org/ipr.html>. |
| |
| The IETF takes no position regarding the validity or scope of any |
| intellectual property or other rights that might be claimed to |
| pertain to the implementation or use of the technology described in |
| this document or the extent to which any license under such rights |
| might or might not be available; neither does it represent that it |
| has made any effort to identify any such rights. Information on the |
| IETF's procedures with respect to rights in standards-track and |
| standards-related documentation can be found in BCP-11. Copies of |
| claims of rights made available for publication and any assurances of |
| licenses to be made available, or the result of an attempt made to |
| obtain a general license or permission for the use of such |
| proprietary rights by implementors or users of this specification can |
| be obtained from the IETF Secretariat. |
| |
| |
| |
| 15. IANA CONSIDERATIONS |
| |
| The Internet Assigned Numbers Authority (IANA) administers the number |
| space for Secondary Compressor ID values. Values and their meaning |
| must be documented in an RFC or other peer-reviewed, permanent, and |
| readily available reference, in sufficient detail so that |
| interoperability between independent implementations is possible. |
| Subject to these constraints, name assignments are First Come, First |
| Served - see RFC2434 [13]. Legal ID values are in the range 1..255. |
| |
| This document does not define any values in this number space. |
| |
| |
| 16. REFERENCES |
| |
| [1] D.G. Korn and K.P. Vo, Vdelta: Differencing and Compression, |
| Practical Reusable Unix Software, Editor B. Krishnamurthy, |
| John Wiley & Sons, Inc., 1995. |
| |
| [2] J. Ziv and A. Lempel, A Universal Algorithm for Sequential Data |
| Compression, IEEE Trans. on Information Theory, 23(3):337-343, 1977. |
| |
| [3] W. Tichy, The String-to-String Correction Problem with Block Moves, |
| ACM Transactions on Computer Systems, 2(4):309-321, November 1984. |
| |
| [4] E.M. McCreight, A Space-Economical Suffix Tree Construction |
| Algorithm, Journal of the ACM, 23:262-272, 1976. |
| |
| [5] J.J. Hunt, K.P. Vo, W. Tichy, An Empirical Study of Delta Algorithms, |
| IEEE Software Configuration and Maintenance Workshop, 1996. |
| |
| [6] J.J. Hunt, K.P. Vo, W. Tichy, Delta Algorithms: An Empirical Analysis, |
| ACM Trans. on Software Engineering and Methodology, 7:192-214, 1998. |
| |
| [7] D.G. Korn, K.P. Vo, Sfio: A buffered I/O Library, |
| Proc. of the Summer '91 Usenix Conference, 1991. |
| |
| [8] D. W. Jones, Application of Splay Trees to Data Compression, |
| CACM, 31(8):996:1007. |
| |
| [9] M. Nelson, J. Gailly, The Data Compression Book, ISBN 1-55851-434-1, |
| M&T Books, New York, NY, 1995. |
| |
| [10] J.C. Mogul, F. Douglis, A. Feldmann, and B. Krishnamurthy, |
| Potential benefits of delta encoding and data compression for HTTP, |
| SIGCOMM '97, Cannes, France, 1997. |
| |
| [11] J.C. Mogul, B. Krishnamurthy, F. Douglis, A. Feldmann, |
| Y. Goland, and A. Van Hoff, Delta Encoding in HTTP, |
| IETF, draft-mogul-http-delta-10, 2001. |
| |
| [12] S. Bradner, Key words for use in RFCs to Indicate Requirement Levels, |
| RFC 2119, March 1997. |
| |
| [13] T. Narten, H. Alvestrand, Guidelines for Writing an IANA |
| Considerations Section in RFCs, RFC2434, October 1998. |
| |
| |
| |
| 17. AUTHOR'S ADDRESS |
| |
| Kiem-Phong Vo (main contact) |
| AT&T Labs, Room D223 |
| 180 Park Avenue |
| Florham Park, NJ 07932 |
| Email: [email protected] |
| Phone: 1 973 360 8630 |
| |
| David G. Korn |
| AT&T Labs, Room D237 |
| 180 Park Avenue |
| Florham Park, NJ 07932 |
| Email: [email protected] |
| Phone: 1 973 360 8602 |
| |
| Jeffrey C. Mogul |
| Western Research Laboratory |
| Compaq Computer Corporation |
| 250 University Avenue |
| Palo Alto, California, 94305, U.S.A. |
| Email: [email protected] |
| Phone: 1 650 617 3304 (email preferred) |
| |
| Joshua P. MacDonald |
| Computer Science Division |
| University of California, Berkeley |
| 345 Soda Hall |
| Berkeley, CA 94720 |
| Email: [email protected] |