IMPLEMENTATION.md - platform/external/rust/crates/sharded-slab - Git at Google

 Notes on `sharded-slab`'s implementation and design.

 # Design

 The sharded slab's design is strongly inspired by the ideas presented by
 Leijen, Zorn, and de Moura in [Mimalloc: Free List Sharding in
 Action][mimalloc]. In this report, the authors present a novel design for a
 memory allocator based on a concept of _free list sharding_.

 Memory allocators must keep track of what memory regions are not currently
 allocated ("free") in order to provide them to future allocation requests.
 The term [_free list_][freelist] refers to a technique for performing this
 bookkeeping, where each free block stores a pointer to the next free block,
 forming a linked list. The memory allocator keeps a pointer to the most
 recently freed block, the _head_ of the free list. To allocate more memory,
 the allocator pops from the free list by setting the head pointer to the
 next free block of the current head block, and returning the previous head.
 To deallocate a block, the block is pushed to the free list by setting its
 first word to the current head pointer, and the head pointer is set to point
 to the deallocated block. Most implementations of slab allocators backed by
 arrays or vectors use a similar technique, where pointers are replaced by
 indices into the backing array.

 When allocations and deallocations can occur concurrently across threads,
 they must synchronize accesses to the free list; either by putting the
 entire allocator state inside of a lock, or by using atomic operations to
 treat the free list as a lock-free structure (such as a [Treiber stack]). In
 both cases, there is a significant performance cost — even when the free
 list is lock-free, it is likely that a noticeable amount of time will be
 spent in compare-and-swap loops. Ideally, the global synchronzation point
 created by the single global free list could be avoided as much as possible.

 The approach presented by Leijen, Zorn, and de Moura is to introduce
 sharding and thus increase the granularity of synchronization significantly.
 In mimalloc, the heap is _sharded_ so that each thread has its own
 thread-local heap. Objects are always allocated from the local heap of the
 thread where the allocation is performed. Because allocations are always
 done from a thread's local heap, they need not be synchronized.

 However, since objects can move between threads before being deallocated,
 _deallocations_ may still occur concurrently. Therefore, Leijen et al.
 introduce a concept of _local_ and _global_ free lists. When an object is
 deallocated on the same thread it was originally allocated on, it is placed
 on the local free list; if it is deallocated on another thread, it goes on
 the global free list for the heap of the thread from which it originated. To
 allocate, the local free list is used first; if it is empty, the entire
 global free list is popped onto the local free list. Since the local free
 list is only ever accessed by the thread it belongs to, it does not require
 synchronization at all, and because the global free list is popped from
 infrequently, the cost of synchronization has a reduced impact. A majority
 of allocations can occur without any synchronization at all; and
 deallocations only require synchronization when an object has left its
 parent thread (a relatively uncommon case).

 [mimalloc]: https://www.microsoft.com/en-us/research/uploads/prod/2019/06/mimalloc-tr-v1.pdf
 [freelist]: https://en.wikipedia.org/wiki/Free_list
 [Treiber stack]: https://en.wikipedia.org/wiki/Treiber_stack

 # Implementation

 A slab is represented as an array of [`MAX_THREADS`] _shards_. A shard
 consists of a vector of one or more _pages_ plus associated metadata.
 Finally, a page consists of an array of _slots_, head indices for the local
 and remote free lists.

 ```text
 ┌─────────────┐
 │ shard 1     │
 │             │    ┌─────────────┐        ┌────────┐
 │ pages───────┼───▶│ page 1      │        │        │
 ├─────────────┤    ├─────────────┤  ┌────▶│  next──┼─┐
 │ shard 2     │    │ page 2      │  │     ├────────┤ │
 ├─────────────┤    │             │  │     │XXXXXXXX│ │
 │ shard 3     │    │ local_head──┼──┘     ├────────┤ │
 └─────────────┘    │ remote_head─┼──┐     │        │◀┘
       ...          ├─────────────┤  │     │  next──┼─┐
 ┌─────────────┐    │ page 3      │  │     ├────────┤ │
 │ shard n     │    └─────────────┘  │     │XXXXXXXX│ │
 └─────────────┘          ...        │     ├────────┤ │
                    ┌─────────────┐  │     │XXXXXXXX│ │
                    │ page n      │  │     ├────────┤ │
                    └─────────────┘  │     │        │◀┘
                                     └────▶│  next──┼───▶  ...
                                           ├────────┤
                                           │XXXXXXXX│
                                           └────────┘
 ```


 The size of the first page in a shard is always a power of two, and every
 subsequent page added after the first is twice as large as the page that
 preceeds it.

 ```text

 pg.
 ┌───┐   ┌─┬─┐
 │ 0 │───▶ │ │
 ├───┤   ├─┼─┼─┬─┐
 │ 1 │───▶ │ │ │ │
 ├───┤   ├─┼─┼─┼─┼─┬─┬─┬─┐
 │ 2 │───▶ │ │ │ │ │ │ │ │
 ├───┤   ├─┼─┼─┼─┼─┼─┼─┼─┼─┬─┬─┬─┬─┬─┬─┬─┐
 │ 3 │───▶ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │
 └───┘   └─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┘
 ```

 When searching for a free slot, the smallest page is searched first, and if
 it is full, the search proceeds to the next page until either a free slot is
 found or all available pages have been searched. If all available pages have
 been searched and the maximum number of pages has not yet been reached, a
 new page is then allocated.

 Since every page is twice as large as the previous page, and all page sizes
 are powers of two, we can determine the page index that contains a given
 address by shifting the address down by the smallest page size and
 looking at how many twos places necessary to represent that number,
 telling us what power of two page size it fits inside of. We can
 determine the number of twos places by counting the number of leading
 zeros (unused twos places) in the number's binary representation, and
 subtracting that count from the total number of bits in a word.

 The formula for determining the page number that contains an offset is thus:

 ```rust,ignore
 WIDTH - ((offset + INITIAL_PAGE_SIZE) >> INDEX_SHIFT).leading_zeros()
 ```

 where `WIDTH` is the number of bits in a `usize`, and `INDEX_SHIFT` is

 ```rust,ignore
 INITIAL_PAGE_SIZE.trailing_zeros() + 1;
 ```

 [`MAX_THREADS`]: https://docs.rs/sharded-slab/latest/sharded_slab/trait.Config.html#associatedconstant.MAX_THREADS
	Notes on `sharded-slab`'s implementation and design.

	# Design

	The sharded slab's design is strongly inspired by the ideas presented by
	Leijen, Zorn, and de Moura in [Mimalloc: Free List Sharding in
	Action][mimalloc]. In this report, the authors present a novel design for a
	memory allocator based on a concept of _free list sharding_.

	Memory allocators must keep track of what memory regions are not currently
	allocated ("free") in order to provide them to future allocation requests.
	The term [_free list_][freelist] refers to a technique for performing this
	bookkeeping, where each free block stores a pointer to the next free block,
	forming a linked list. The memory allocator keeps a pointer to the most
	recently freed block, the _head_ of the free list. To allocate more memory,
	the allocator pops from the free list by setting the head pointer to the
	next free block of the current head block, and returning the previous head.
	To deallocate a block, the block is pushed to the free list by setting its
	first word to the current head pointer, and the head pointer is set to point
	to the deallocated block. Most implementations of slab allocators backed by
	arrays or vectors use a similar technique, where pointers are replaced by
	indices into the backing array.

	When allocations and deallocations can occur concurrently across threads,
	they must synchronize accesses to the free list; either by putting the
	entire allocator state inside of a lock, or by using atomic operations to
	treat the free list as a lock-free structure (such as a [Treiber stack]). In
	both cases, there is a significant performance cost — even when the free
	list is lock-free, it is likely that a noticeable amount of time will be
	spent in compare-and-swap loops. Ideally, the global synchronzation point
	created by the single global free list could be avoided as much as possible.

	The approach presented by Leijen, Zorn, and de Moura is to introduce
	sharding and thus increase the granularity of synchronization significantly.
	In mimalloc, the heap is _sharded_ so that each thread has its own
	thread-local heap. Objects are always allocated from the local heap of the
	thread where the allocation is performed. Because allocations are always
	done from a thread's local heap, they need not be synchronized.

	However, since objects can move between threads before being deallocated,
	_deallocations_ may still occur concurrently. Therefore, Leijen et al.
	introduce a concept of _local_ and _global_ free lists. When an object is
	deallocated on the same thread it was originally allocated on, it is placed
	on the local free list; if it is deallocated on another thread, it goes on
	the global free list for the heap of the thread from which it originated. To
	allocate, the local free list is used first; if it is empty, the entire
	global free list is popped onto the local free list. Since the local free
	list is only ever accessed by the thread it belongs to, it does not require
	synchronization at all, and because the global free list is popped from
	infrequently, the cost of synchronization has a reduced impact. A majority
	of allocations can occur without any synchronization at all; and
	deallocations only require synchronization when an object has left its
	parent thread (a relatively uncommon case).

	[mimalloc]: https://www.microsoft.com/en-us/research/uploads/prod/2019/06/mimalloc-tr-v1.pdf
	[freelist]: https://en.wikipedia.org/wiki/Free_list
	[Treiber stack]: https://en.wikipedia.org/wiki/Treiber_stack

	# Implementation

	A slab is represented as an array of [`MAX_THREADS`] _shards_. A shard
	consists of a vector of one or more _pages_ plus associated metadata.
	Finally, a page consists of an array of _slots_, head indices for the local
	and remote free lists.

	```text
	┌─────────────┐
	│ shard 1 │
	│ │ ┌─────────────┐ ┌────────┐
	│ pages───────┼───▶│ page 1 │ │ │
	├─────────────┤ ├─────────────┤ ┌────▶│ next──┼─┐
	│ shard 2 │ │ page 2 │ │ ├────────┤ │
	├─────────────┤ │ │ │ │XXXXXXXX│ │
	│ shard 3 │ │ local_head──┼──┘ ├────────┤ │
	└─────────────┘ │ remote_head─┼──┐ │ │◀┘
	... ├─────────────┤ │ │ next──┼─┐
	┌─────────────┐ │ page 3 │ │ ├────────┤ │
	│ shard n │ └─────────────┘ │ │XXXXXXXX│ │
	└─────────────┘ ... │ ├────────┤ │
	┌─────────────┐ │ │XXXXXXXX│ │
	│ page n │ │ ├────────┤ │
	└─────────────┘ │ │ │◀┘
	└────▶│ next──┼───▶ ...
	├────────┤
	│XXXXXXXX│
	└────────┘
	```


	The size of the first page in a shard is always a power of two, and every
	subsequent page added after the first is twice as large as the page that
	preceeds it.

	```text

	pg.
	┌───┐ ┌─┬─┐
	│ 0 │───▶ │ │
	├───┤ ├─┼─┼─┬─┐
	│ 1 │───▶ │ │ │ │
	├───┤ ├─┼─┼─┼─┼─┬─┬─┬─┐
	│ 2 │───▶ │ │ │ │ │ │ │ │
	├───┤ ├─┼─┼─┼─┼─┼─┼─┼─┼─┬─┬─┬─┬─┬─┬─┬─┐
	│ 3 │───▶ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │
	└───┘ └─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┘
	```

	When searching for a free slot, the smallest page is searched first, and if
	it is full, the search proceeds to the next page until either a free slot is
	found or all available pages have been searched. If all available pages have
	been searched and the maximum number of pages has not yet been reached, a
	new page is then allocated.

	Since every page is twice as large as the previous page, and all page sizes
	are powers of two, we can determine the page index that contains a given
	address by shifting the address down by the smallest page size and
	looking at how many twos places necessary to represent that number,
	telling us what power of two page size it fits inside of. We can
	determine the number of twos places by counting the number of leading
	zeros (unused twos places) in the number's binary representation, and
	subtracting that count from the total number of bits in a word.

	The formula for determining the page number that contains an offset is thus:

	```rust,ignore
	WIDTH - ((offset + INITIAL_PAGE_SIZE) >> INDEX_SHIFT).leading_zeros()
	```

	where `WIDTH` is the number of bits in a `usize`, and `INDEX_SHIFT` is

	```rust,ignore
	INITIAL_PAGE_SIZE.trailing_zeros() + 1;
	```

	[`MAX_THREADS`]: https://docs.rs/sharded-slab/latest/sharded_slab/trait.Config.html#associatedconstant.MAX_THREADS