vendor/tendril-0.4.3/README.md - toolchain/rustc - Git at Google

 # tendril

 **Warning**: This library is at a very early stage of development, and it
 contains a substantial amount of `unsafe` code. Use at your own risk!

 [![Build Status](https://github.com/servo/tendril/workflows/CI/badge.svg)](https://github.com/servo/tendril/actions)

 [API Documentation](https://doc.servo.org/tendril/index.html)

 ## Introduction

 `Tendril` is a compact string/buffer type, optimized for zero-copy parsing.
 Tendrils have the semantics of owned strings, but are sometimes views into
 shared buffers. When you mutate a tendril, an owned copy is made if necessary.
 Further mutations occur in-place until the string becomes shared, e.g. with
 `clone()` or `subtendril()`.

 Buffer sharing is accomplished through thread-local (non-atomic) reference
 counting, which has very low overhead. The Rust type system will prevent
 you at compile time from sending a tendril between threads. (See below
 for thoughts on relaxing this restriction.)

 Whereas `String` allocates in the heap for any non-empty string, `Tendril` can
 store small strings (up to 8 bytes) in-line, without a heap allocation.
 `Tendril` is also smaller than `String` on 64-bit platforms — 16 bytes versus
 24. `Option<Tendril>` is the same size as `Tendril`, thanks to
 [`NonZero`][NonZero].

 The maximum length of a tendril is 4 GB. The library will panic if you attempt
 to go over the limit.

 ## Formats and encoding

 `Tendril` uses
 [phantom types](https://doc.rust-lang.org/stable/rust-by-example/generics/phantom.html)
 to track a buffer's format. This determines at compile time which
 operations are available on a given tendril. For example, `Tendril<UTF8>` and
 `Tendril<Bytes>` can be borrowed as `&str` and `&[u8]` respectively.

 `Tendril` also integrates with
 [rust-encoding](https://github.com/lifthrasiir/rust-encoding) and has
 preliminary support for [WTF-8][] buffers.

 ## Plans for the future

 ### Ropes

 [html5ever][] will use `Tendril` as a zero-copy text representation. It would
 be good to preserve this all the way through to Servo's DOM. This would reduce
 memory consumption, and possibly speed up text shaping and painting. However,
 DOM text may conceivably be larger than 4 GB, and will anyway not be contiguous
 in memory around e.g. a character entity reference.

 *Solution:* Build a **[rope][] on top of these strings** and use that as
 Servo's representation of DOM text. We can perhaps do text shaping and/or
 painting in parallel for different chunks of a rope. html5ever can additionally
 use this rope type as a replacement for `BufferQueue`.

 Because the underlying buffers are reference-counted, the bulk of this rope
 is already a [persistent data structure][]. Consider what happens when
 appending two ropes to get a "new" rope. A vector-backed rope would copy a
 vector of small structs, one for each chunk, and would bump the corresponding
 refcounts. But it would not copy any of the string data.

 If we want more sharing, then a [2-3 finger tree][] could be a good choice.
 We would probably stick with `VecDeque` for ropes under a certain size.

 ### UTF-16 compatibility

 SpiderMonkey expects text to be in UCS-2 format for the most part. The
 semantics of JavaScript strings are difficult to implement on UTF-8. This also
 applies to HTML parsing via `document.write`. Also, passing SpiderMonkey a
 string that isn't contiguous in memory will incur additional overhead and
 complexity, if not a full copy.

 *Solution:* Use **WTF-8 in parsing** and in the DOM. Servo will **convert to
 contiguous UTF-16 when necessary**.  The conversion can easily be parallelized,
 if we find a practical need to convert huge chunks of text all at once.

 ### Source span information

 Some html5ever API consumers want to know the originating location in the HTML
 source file(s) of each token or parse error. An example application would be a
 command-line HTML validator with diagnostic output similar to `rustc`'s.

 *Solution:* Accept **some metadata along with each input string**. The type of
 metadata is chosen by the API consumer; it defaults to `()`, which has size
 zero. For any non-inline string, we can provide the associated metadata as well
 as a byte offset.

 [NonZero]: https://doc.rust-lang.org/core/nonzero/struct.NonZero.html
 [html5ever]: https://github.com/servo/html5ever
 [WTF-8]: https://simonsapin.github.io/wtf-8/
 [rope]: https://en.wikipedia.org/wiki/Rope_%28data_structure%29
 [persistent data structure]: https://en.wikipedia.org/wiki/Persistent_data_structure
 [2-3 finger tree]: https://www.staff.city.ac.uk/~ross/papers/FingerTree.html
	# tendril

	Warning: This library is at a very early stage of development, and it
	contains a substantial amount of `unsafe` code. Use at your own risk!

	[![Build Status](https://github.com/servo/tendril/workflows/CI/badge.svg)](https://github.com/servo/tendril/actions)

	[API Documentation](https://doc.servo.org/tendril/index.html)

	## Introduction

	`Tendril` is a compact string/buffer type, optimized for zero-copy parsing.
	Tendrils have the semantics of owned strings, but are sometimes views into
	shared buffers. When you mutate a tendril, an owned copy is made if necessary.
	Further mutations occur in-place until the string becomes shared, e.g. with
	`clone()` or `subtendril()`.

	Buffer sharing is accomplished through thread-local (non-atomic) reference
	counting, which has very low overhead. The Rust type system will prevent
	you at compile time from sending a tendril between threads. (See below
	for thoughts on relaxing this restriction.)

	Whereas `String` allocates in the heap for any non-empty string, `Tendril` can
	store small strings (up to 8 bytes) in-line, without a heap allocation.
	`Tendril` is also smaller than `String` on 64-bit platforms — 16 bytes versus
	24. `Option<Tendril>` is the same size as `Tendril`, thanks to
	[`NonZero`][NonZero].

	The maximum length of a tendril is 4 GB. The library will panic if you attempt
	to go over the limit.

	## Formats and encoding

	`Tendril` uses
	[phantom types](https://doc.rust-lang.org/stable/rust-by-example/generics/phantom.html)
	to track a buffer's format. This determines at compile time which
	operations are available on a given tendril. For example, `Tendril<UTF8>` and
	`Tendril<Bytes>` can be borrowed as `&str` and `&[u8]` respectively.

	`Tendril` also integrates with
	[rust-encoding](https://github.com/lifthrasiir/rust-encoding) and has
	preliminary support for [WTF-8][] buffers.

	## Plans for the future

	### Ropes

	[html5ever][] will use `Tendril` as a zero-copy text representation. It would
	be good to preserve this all the way through to Servo's DOM. This would reduce
	memory consumption, and possibly speed up text shaping and painting. However,
	DOM text may conceivably be larger than 4 GB, and will anyway not be contiguous
	in memory around e.g. a character entity reference.

	Solution: Build a [rope][] on top of these strings and use that as
	Servo's representation of DOM text. We can perhaps do text shaping and/or
	painting in parallel for different chunks of a rope. html5ever can additionally
	use this rope type as a replacement for `BufferQueue`.

	Because the underlying buffers are reference-counted, the bulk of this rope
	is already a [persistent data structure][]. Consider what happens when
	appending two ropes to get a "new" rope. A vector-backed rope would copy a
	vector of small structs, one for each chunk, and would bump the corresponding
	refcounts. But it would not copy any of the string data.

	If we want more sharing, then a [2-3 finger tree][] could be a good choice.
	We would probably stick with `VecDeque` for ropes under a certain size.

	### UTF-16 compatibility

	SpiderMonkey expects text to be in UCS-2 format for the most part. The
	semantics of JavaScript strings are difficult to implement on UTF-8. This also
	applies to HTML parsing via `document.write`. Also, passing SpiderMonkey a
	string that isn't contiguous in memory will incur additional overhead and
	complexity, if not a full copy.

	Solution: Use WTF-8 in parsing and in the DOM. Servo will **convert to
	contiguous UTF-16 when necessary**. The conversion can easily be parallelized,
	if we find a practical need to convert huge chunks of text all at once.

	### Source span information

	Some html5ever API consumers want to know the originating location in the HTML
	source file(s) of each token or parse error. An example application would be a
	command-line HTML validator with diagnostic output similar to `rustc`'s.

	Solution: Accept some metadata along with each input string. The type of
	metadata is chosen by the API consumer; it defaults to `()`, which has size
	zero. For any non-inline string, we can provide the associated metadata as well
	as a byte offset.

	[NonZero]: https://doc.rust-lang.org/core/nonzero/struct.NonZero.html
	[html5ever]: https://github.com/servo/html5ever
	[WTF-8]: https://simonsapin.github.io/wtf-8/
	[rope]: https://en.wikipedia.org/wiki/Rope_%28data_structure%29
	[persistent data structure]: https://en.wikipedia.org/wiki/Persistent_data_structure
	[2-3 finger tree]: https://www.staff.city.ac.uk/~ross/papers/FingerTree.html