s2storage/README.md - platform/packages/modules/GeoTZ - Git at Google

 This directory contains files related to storage of TZ S2 data files used for offline geolocation
 time zone detection.

 High level file structure
 =========================

   * `src/readonly/` - host + device code for reading TZ S2 data files
   * `src/write/`    - host code for writing TZ S2 data files
   * `src/test/`     - host tests for readonly/ and write/ code
   * `tools/`        - host tooling to support generation / debugging / testing of TZ S2 data
                       files.

 Block file format information
 =============================

 A "block file" is a general-purpose file format containing a small amount of header information,
 blocks of data, and metadata about those blocks. All types are big-endian / network byte ordered.

 Blocks are assigned contiguous IDs which are numbered from zero.

 1. The file header has a type-identifying magic, and a version.
 2. Next are the block infos, which hold metadata about all blocks in the file such as their ID,
 a type ID, (optional) arbitrary "extra" bytes, and size / offset information for the block.
 3. Lastly, come the blocks of data themselves. Blocks can be zero-length, in which case they take up
 no space in the file.

 Packed tables
 =============

 Packed tables are a way of arranging block data to store tables of key-ordered key / value pairs in
 a compact way. Each entry in the table takes up a fixed number of bytes.

 Packed tables may contain some (optional) shared information that applies to all records in the
 table. Then they contain one or more fixed length `{key}`/`{value}` records of `{R}` bits sorted by
 `{key}`.  The table data is easily memory mapped and each record can be randomly accessed by
 `{key}`.

 TZ S2 data file format information
 ==================================

 The TZ S2 data file is a type of block file that uses packed tables. It overlays additional rules,
 block types, etc. on top of the general block file format.

 High level description
 ----------------------

 The TZ S2 data file format is intended to be language neutral (e.g. Java or C code could easily be
 written to read it with minimal dependencies), providing a good balance between file size,
 extensibility and memory usage at runtime.

 It is designed to be flexible and parameterized / self describing: the header block contains format
 information needed to locate and interpret the data storage blocks.

 The file stores time zone geolocation data at a single S2 level. Logically, the data consists of:

 ```
 {start S2 cell ID (inclusive)}, {end S2 cell ID (exclusive)}, {time zone IDs}
 ```

 The main usecase of the file is to lookup the time zone ID(s) (if any) for a given S2 cell ID.

 General storage approach
 ------------------------

 To keep the file size to a minimum, the format avoids storing 64-bit S2 cell IDs directly. Instead,
 the logical input data is mapped to a data structure that allows the individual S2 cell IDs to be
 stored using only a subset of the bits needed to store a full S2 cell ID.

 Each logical S2 range data described above is subdivided into ranges with a common S2 cell ID
 prefix. Any ranges in the source data that span a prefix are split into smaller ranges to preserve
 the "same prefix" property. All ranges with a common prefix can be stored in one "suffix table".

 The time zone ID strings are also only stored once and are referenced indirectly, avoiding repeated
 storage of common strings.

 TZ S2 data lookup
 -----------------

 Suffix table block IDs are calculated by taking the prefix of the S2 cell ID being sought and
 applying a fixed offset. The block info and block data for the cell's suffix table can be accessed
 directly using the block's ID.

 Specifically:

 The `{prefix}` is computed by extracting the first `{X}` bits of the S2 cell ID. The `{prefix}` is
 used to obtain the `{block ID}` of the block used to store the suffix table. The `{block ID}` is
 calculated by adding a fixed offset (obtained from the header block) to the cell ID `{prefix}`.

 The `{block ID}` is first used to look lookup the block info. If the length of the block with
 `{block ID}` is zero, the lookup stops at this point as it means there are no ranges for `{prefix}`.

 When the `{block ID}` block is non-zero length, the block is interpreted as a packed table
 (described above) which stores the suffix table's entries.

 The `{suffix}`, the final `{Y}` bits of the search S2 cell ID, is used to seek for a record
 containing the s2 range holding that cell ID, if any. The `{suffix}` will match either no records or
 only one record in the table.

 For more information on suffix table storage, see the Suffix Table Blocks section below.

 The Header Block
 ----------------

 The header block is always required to be block zero and have the expected type
 ID (1).

 The header contains format information needed to address the suffix table blocks
 and the block format information needed to interpret those blocks. It also
 contains information shared by all blocks such as the TZ ID sets.

 TZ ID Sets storage
 ------------------

 Sets of one or more time zone IDs are referenced by every range stored in the TZ S2 data file.

 Individual time zone IDs are strings like "America/Los_Angeles" that should only be stored once to
 conserve space.

 Further, the time zone IDs are referenced as sets, e.g. one cell range may reference
 "Europe/London", another may reference "Europe/Paris" and another may reference
 both "Europe/London" and "Europe/Paris".

 It is important to keep the number of bits used in each suffix table entry to a
 minimum because there are hundreds of thousands of ranges globally and hundreds
 of distinct sets. To do this we make use of "banking", which leverages
 properties of the data.

 For example:

 1. Several ranges with S2 cell IDs close together may reference the same set - e.g. there
 will be several range entries that reference "Europe/London".
 2. There is unlikely to a single S2 cell that needs to reference both "America/Los_Angeles" and
 "Europe/London", since the areas covered by those time zone IDs are geographically separate.

 Consequently:

 Every time zone ID string referenced in the file is assigned a numeric ID. The string is stored once
 in the file in the header block. All references to time zone IDs are made via the numeric ID.

 e.g.
 ```
 1: Europe/London
 2: Europe/Paris
 3: ...
 ```

 Every TZ S2 data file has one or more "set banks". Each bank contains an array of `{set of time zone
 IDs}`.

 A bank may contain many sets, which various combinations of TZ IDs:
 ```
 1: {1}   - meaning the set {"Europe/London"}
 2: {2}   - meaning the set {"Europe/Paris"}
 3: {1,2} - meaning the set {"Europe/London", "Europe/Paris"}
 ...
 ```

 Via this indirection and banking, each range entry can address a set of time zone ID strings using
 only a numeric bank ID and a numeric set ID. The bank ID is the same for all entries in suffix
 table, so this means that it can be stored once per table and only the (small) `{TZ IDs set ID}`
 needs to be stored with each entry.

 Suffix Table Blocks
 -------------------

 The suffix table block is a packed table with shared information and one or more entries.

 The shared information consists of:

 ```
 {TZ IDs set bank}       - the bank ID of the TZ IDs set bank used when looking up time zone set IDs
                           referenced by all table entries.
 ```

 Each record in the suffix table logically holds entries consisting of:
 ```
 {start S2 cell ID (inclusive)}, {end S2 cell ID (exclusive)}, {time zone IDs}`
 ```

 As with any packed table, each record in the packed table has a fixed width of `{R}` bits. The first
 `{M}` bits of every record are used to store the (ordered) `{key}`.

 The `{key}` for an entry contains only the `{suffix}` bits from the `{start S2 cell ID
 (inclusive)}`. To reconstruct the `{start S2 cell ID (inclusive)}` it's only necessary to know
 the `{prefix}` for the table and the `{key}`.

 The remaining (`{R-M}`) bits are used to store the ``{value}``. The ``{value}`` is further
 sub-divided into two: the `{end S2 cell ID offset}` and the `{TZ IDs set ID}`.

 The `{end S2 cell ID offset}` is a transformation of the `{end S2 cell ID (exclusive)}`. The `{end
 S2 cell ID}` can be calculated by adding the `{end S2 cell ID offset}` to the `{start S2 cell ID
 (inclusive)}`.

 When searching for an S2 cell ID, the prefix is used to locate the correct suffix table. The suffix
 bits from the S2 cell ID can be extracted. Since all data in the table is held at a single S2 level,
 the suffix bits can be used to binary search the suffix table entries by looking for an entry
 containing the suffix bits, i.e. by comparing the suffix bits against the `{key}` and `{key}` +
 `{end S2 cell ID offset}` value.

 If an entry is found, the `{TZ set ID}` indirectly leads to the `{time zone IDs}` for the range. For
 more information see TZ ID Sets storage above.
	This directory contains files related to storage of TZ S2 data files used for offline geolocation
	time zone detection.

	High level file structure
	=========================

	* `src/readonly/` - host + device code for reading TZ S2 data files
	* `src/write/` - host code for writing TZ S2 data files
	* `src/test/` - host tests for readonly/ and write/ code
	* `tools/` - host tooling to support generation / debugging / testing of TZ S2 data
	files.

	Block file format information
	=============================

	A "block file" is a general-purpose file format containing a small amount of header information,
	blocks of data, and metadata about those blocks. All types are big-endian / network byte ordered.

	Blocks are assigned contiguous IDs which are numbered from zero.

	1. The file header has a type-identifying magic, and a version.
	2. Next are the block infos, which hold metadata about all blocks in the file such as their ID,
	a type ID, (optional) arbitrary "extra" bytes, and size / offset information for the block.
	3. Lastly, come the blocks of data themselves. Blocks can be zero-length, in which case they take up
	no space in the file.

	Packed tables
	=============

	Packed tables are a way of arranging block data to store tables of key-ordered key / value pairs in
	a compact way. Each entry in the table takes up a fixed number of bytes.

	Packed tables may contain some (optional) shared information that applies to all records in the
	table. Then they contain one or more fixed length `{key}`/`{value}` records of `{R}` bits sorted by
	`{key}`. The table data is easily memory mapped and each record can be randomly accessed by
	`{key}`.

	TZ S2 data file format information
	==================================

	The TZ S2 data file is a type of block file that uses packed tables. It overlays additional rules,
	block types, etc. on top of the general block file format.

	High level description
	----------------------

	The TZ S2 data file format is intended to be language neutral (e.g. Java or C code could easily be
	written to read it with minimal dependencies), providing a good balance between file size,
	extensibility and memory usage at runtime.

	It is designed to be flexible and parameterized / self describing: the header block contains format
	information needed to locate and interpret the data storage blocks.

	The file stores time zone geolocation data at a single S2 level. Logically, the data consists of:

	```
	{start S2 cell ID (inclusive)}, {end S2 cell ID (exclusive)}, {time zone IDs}
	```

	The main usecase of the file is to lookup the time zone ID(s) (if any) for a given S2 cell ID.

	General storage approach
	------------------------

	To keep the file size to a minimum, the format avoids storing 64-bit S2 cell IDs directly. Instead,
	the logical input data is mapped to a data structure that allows the individual S2 cell IDs to be
	stored using only a subset of the bits needed to store a full S2 cell ID.

	Each logical S2 range data described above is subdivided into ranges with a common S2 cell ID
	prefix. Any ranges in the source data that span a prefix are split into smaller ranges to preserve
	the "same prefix" property. All ranges with a common prefix can be stored in one "suffix table".

	The time zone ID strings are also only stored once and are referenced indirectly, avoiding repeated
	storage of common strings.

	TZ S2 data lookup
	-----------------

	Suffix table block IDs are calculated by taking the prefix of the S2 cell ID being sought and
	applying a fixed offset. The block info and block data for the cell's suffix table can be accessed
	directly using the block's ID.

	Specifically:

	The `{prefix}` is computed by extracting the first `{X}` bits of the S2 cell ID. The `{prefix}` is
	used to obtain the `{block ID}` of the block used to store the suffix table. The `{block ID}` is
	calculated by adding a fixed offset (obtained from the header block) to the cell ID `{prefix}`.

	The `{block ID}` is first used to look lookup the block info. If the length of the block with
	`{block ID}` is zero, the lookup stops at this point as it means there are no ranges for `{prefix}`.

	When the `{block ID}` block is non-zero length, the block is interpreted as a packed table
	(described above) which stores the suffix table's entries.

	The `{suffix}`, the final `{Y}` bits of the search S2 cell ID, is used to seek for a record
	containing the s2 range holding that cell ID, if any. The `{suffix}` will match either no records or
	only one record in the table.

	For more information on suffix table storage, see the Suffix Table Blocks section below.

	The Header Block
	----------------

	The header block is always required to be block zero and have the expected type
	ID (1).

	The header contains format information needed to address the suffix table blocks
	and the block format information needed to interpret those blocks. It also
	contains information shared by all blocks such as the TZ ID sets.

	TZ ID Sets storage
	------------------

	Sets of one or more time zone IDs are referenced by every range stored in the TZ S2 data file.

	Individual time zone IDs are strings like "America/Los_Angeles" that should only be stored once to
	conserve space.

	Further, the time zone IDs are referenced as sets, e.g. one cell range may reference
	"Europe/London", another may reference "Europe/Paris" and another may reference
	both "Europe/London" and "Europe/Paris".

	It is important to keep the number of bits used in each suffix table entry to a
	minimum because there are hundreds of thousands of ranges globally and hundreds
	of distinct sets. To do this we make use of "banking", which leverages
	properties of the data.

	For example:

	1. Several ranges with S2 cell IDs close together may reference the same set - e.g. there
	will be several range entries that reference "Europe/London".
	2. There is unlikely to a single S2 cell that needs to reference both "America/Los_Angeles" and
	"Europe/London", since the areas covered by those time zone IDs are geographically separate.

	Consequently:

	Every time zone ID string referenced in the file is assigned a numeric ID. The string is stored once
	in the file in the header block. All references to time zone IDs are made via the numeric ID.

	e.g.
	```
	1: Europe/London
	2: Europe/Paris
	3: ...
	```

	Every TZ S2 data file has one or more "set banks". Each bank contains an array of `{set of time zone
	IDs}`.

	A bank may contain many sets, which various combinations of TZ IDs:
	```
	1: {1} - meaning the set {"Europe/London"}
	2: {2} - meaning the set {"Europe/Paris"}
	3: {1,2} - meaning the set {"Europe/London", "Europe/Paris"}
	...
	```

	Via this indirection and banking, each range entry can address a set of time zone ID strings using
	only a numeric bank ID and a numeric set ID. The bank ID is the same for all entries in suffix
	table, so this means that it can be stored once per table and only the (small) `{TZ IDs set ID}`
	needs to be stored with each entry.

	Suffix Table Blocks
	-------------------

	The suffix table block is a packed table with shared information and one or more entries.

	The shared information consists of:

	```
	{TZ IDs set bank} - the bank ID of the TZ IDs set bank used when looking up time zone set IDs
	referenced by all table entries.
	```

	Each record in the suffix table logically holds entries consisting of:
	```
	{start S2 cell ID (inclusive)}, {end S2 cell ID (exclusive)}, {time zone IDs}`
	```

	As with any packed table, each record in the packed table has a fixed width of `{R}` bits. The first
	`{M}` bits of every record are used to store the (ordered) `{key}`.

	The `{key}` for an entry contains only the `{suffix}` bits from the `{start S2 cell ID
	(inclusive)}`. To reconstruct the `{start S2 cell ID (inclusive)}` it's only necessary to know
	the `{prefix}` for the table and the `{key}`.

	The remaining (`{R-M}`) bits are used to store the ``{value}``. The ``{value}`` is further
	sub-divided into two: the `{end S2 cell ID offset}` and the `{TZ IDs set ID}`.

	The `{end S2 cell ID offset}` is a transformation of the `{end S2 cell ID (exclusive)}`. The `{end
	S2 cell ID}` can be calculated by adding the `{end S2 cell ID offset}` to the `{start S2 cell ID
	(inclusive)}`.

	When searching for an S2 cell ID, the prefix is used to locate the correct suffix table. The suffix
	bits from the S2 cell ID can be extracted. Since all data in the table is held at a single S2 level,
	the suffix bits can be used to binary search the suffix table entries by looking for an entry
	containing the suffix bits, i.e. by comparing the suffix bits against the `{key}` and `{key}` +
	`{end S2 cell ID offset}` value.

	If an entry is found, the `{TZ set ID}` indirectly leads to the `{time zone IDs}` for the range. For
	more information see TZ ID Sets storage above.