Blame - README.md - platform/external/rust/crates/regex-automata

blob: 8eaf03f04680266967648215734bbae70a99cea5 [file] [log] [blame] [view]

Jakub Kotur	3bceaeb	2020-12-21 17:28:16 +0100	[diff] [blame]	1	regex-automata
				2	==============
				3	A low level regular expression library that uses deterministic finite automata.
				4	It supports a rich syntax with Unicode support, has extensive options for
				5	configuring the best space vs time trade off for your use case and provides
				6	support for cheap deserialization of automata for use in `no_std` environments.
				7
				8	[![Build status](https://github.com/BurntSushi/regex-automata/workflows/ci/badge.svg)](https://github.com/BurntSushi/regex-automata/actions)
Joel Galenson	8249a3d	2021-06-21 14:01:00 -0700	[diff] [blame]	9	[![on crates.io](https://meritbadge.herokuapp.com/regex-automata)](https://crates.io/crates/regex-automata)
				10	![Minimum Supported Rust Version 1.41](https://img.shields.io/badge/rustc-1.41-green)
Jakub Kotur	3bceaeb	2020-12-21 17:28:16 +0100	[diff] [blame]	11
Joel Galenson	8249a3d	2021-06-21 14:01:00 -0700	[diff] [blame]	12	Dual-licensed under MIT or the [UNLICENSE](https://unlicense.org/).
Jakub Kotur	3bceaeb	2020-12-21 17:28:16 +0100	[diff] [blame]	13
				14
				15	### Documentation
				16
				17	https://docs.rs/regex-automata
				18
				19
				20	### Usage
				21
				22	Add this to your `Cargo.toml`:
				23
				24	```toml
				25	[dependencies]
				26	regex-automata = "0.1"
				27	```
				28
				29	and this to your crate root (if you're using Rust 2015):
				30
				31	```rust
				32	extern crate regex_automata;
				33	```
				34
				35
				36	### Example: basic regex searching
				37
				38	This example shows how to compile a regex using the default configuration
				39	and then use it to find matches in a byte string:
				40
				41	```rust
				42	use regex_automata::Regex;
				43
				44	let re = Regex::new(r"[0-9]{4}-[0-9]{2}-[0-9]{2}").unwrap();
				45	let text = b"2018-12-24 2016-10-08";
				46	let matches: Vec<(usize, usize)> = re.find_iter(text).collect();
				47	assert_eq!(matches, vec![(0, 10), (11, 21)]);
				48	```
				49
				50	For more examples and information about the various knobs that can be turned,
				51	please see the [docs](https://docs.rs/regex-automata).
				52
				53
				54	### Support for `no_std`
				55
				56	This crate comes with a `std` feature that is enabled by default. When the
				57	`std` feature is enabled, the API of this crate will include the facilities
				58	necessary for compiling, serializing, deserializing and searching with regular
				59	expressions. When the `std` feature is disabled, the API of this crate will
				60	shrink such that it only includes the facilities necessary for deserializing
				61	and searching with regular expressions.
				62
				63	The intended workflow for `no_std` environments is thus as follows:
				64
				65	* Write a program with the `std` feature that compiles and serializes a
				66	regular expression. Serialization should only happen after first converting
				67	the DFAs to use a fixed size state identifier instead of the default `usize`.
				68	You may also need to serialize both little and big endian versions of each
				69	DFA. (So that's 4 DFAs in total for each regex.)
				70	* In your `no_std` environment, follow the examples above for deserializing
				71	your previously serialized DFAs into regexes. You can then search with them
				72	as you would any regex.
				73
				74	Deserialization can happen anywhere. For example, with bytes embedded into a
				75	binary or with a file memory mapped at runtime.
				76
				77	Note that the
				78	[`ucd-generate`](https://github.com/BurntSushi/ucd-generate)
				79	tool will do the first step for you with its `dfa` or `regex` sub-commands.
				80
				81
				82	### Cargo features
				83
				84	* `std` - Enabled by default. This enables the ability to compile finite
				85	automata. This requires the `regex-syntax` dependency. Without this feature
				86	enabled, finite automata can only be used for searching (using the approach
				87	described above).
				88	* `transducer` - Disabled by default. This provides implementations of the
				89	`Automaton` trait found in the `fst` crate. This permits using finite
				90	automata generated by this crate to search finite state transducers. This
				91	requires the `fst` dependency.
				92
				93
				94	### Differences with the regex crate
				95
				96	The main goal of the [`regex`](https://docs.rs/regex) crate is to serve as a
				97	general purpose regular expression engine. It aims to automatically balance low
				98	compile times, fast search times and low memory usage, while also providing
				99	a convenient API for users. In contrast, this crate provides a lower level
				100	regular expression interface that is a bit less convenient while providing more
				101	explicit control over memory usage and search times.
				102
				103	Here are some specific negative differences:
				104
				105	* Compilation can take an exponential amount of time and space in the size
				106	of the regex pattern. While most patterns do not exhibit worst case
				107	exponential time, such patterns do exist. For example, `[01]*1[01]{N}` will
				108	build a DFA with `2^(N+1)` states. For this reason, untrusted patterns should
				109	not be compiled with this library. (In the future, the API may expose an
				110	option to return an error if the DFA gets too big.)
				111	* This crate does not support sub-match extraction, which can be achieved with
				112	the regex crate's "captures" API. This may be added in the future, but is
				113	unlikely.
				114	* While the regex crate doesn't necessarily sport fast compilation times, the
				115	regexes in this crate are almost universally slow to compile, especially when
				116	they contain large Unicode character classes. For example, on my system,
				117	compiling `\w{3}` with byte classes enabled takes just over 1 second and
				118	almost 5MB of memory! (Compiling a sparse regex takes about the same time
				119	but only uses about 500KB of memory.) Conversly, compiling the same regex
				120	without Unicode support, e.g., `(?-u)\w{3}`, takes under 1 millisecond and
				121	less than 5KB of memory. For this reason, you should only use Unicode
				122	character classes if you absolutely need them!
				123	* This crate does not support regex sets.
				124	* This crate does not support zero-width assertions such as `^`, `$`, `\b` or
				125	`\B`.
				126	* As a lower level crate, this library does not do literal optimizations. In
				127	exchange, you get predictable performance regardless of input. The
				128	philosophy here is that literal optimizations should be applied at a higher
				129	level, although there is no easy support for this in the ecosystem yet.
				130	* There is no `&str` API like in the regex crate. In this crate, all APIs
				131	operate on `&[u8]`. By default, match indices are guaranteed to fall on
				132	UTF-8 boundaries, unless `RegexBuilder::allow_invalid_utf8` is enabled.
				133
				134	With some of the downsides out of the way, here are some positive differences:
				135
				136	* Both dense and sparse DFAs can be serialized to raw bytes, and then cheaply
				137	deserialized. Deserialization always takes constant time since searching can
				138	be performed directly on the raw serialized bytes of a DFA.
				139	* This crate was specifically designed so that the searching phase of a DFA has
				140	minimal runtime requirements, and can therefore be used in `no_std`
				141	environments. While `no_std` environments cannot compile regexes, they can
				142	deserialize pre-compiled regexes.
				143	* Since this crate builds DFAs ahead of time, it will generally out-perform
				144	the `regex` crate on equivalent tasks. The performance difference is likely
				145	not large. However, because of a complex set of optimizations in the regex
				146	crate (like literal optimizations), an accurate performance comparison may be
				147	difficult to do.
				148	* Sparse DFAs provide a way to build a DFA ahead of time that sacrifices search
				149	performance a small amount, but uses much less storage space. Potentially
				150	even less than what the regex crate uses.
				151	* This crate exposes DFAs directly, such as `DenseDFA` and `SparseDFA`,
				152	which enables one to do less work in some cases. For example, if you only
				153	need the end of a match and not the start of a match, then you can use a DFA
				154	directly without building a `Regex`, which always requires a second DFA to
				155	find the start of a match.
				156	* Aside from choosing between dense and sparse DFAs, there are several options
				157	for configuring the space usage vs search time trade off. These include
				158	things like choosing a smaller state identifier representation, to
				159	premultiplying state identifiers and splitting a DFA's alphabet into
				160	equivalence classes. Finally, DFA minimization is also provided, but can
				161	increase compilation times dramatically.
				162
				163
				164	### Future work
				165
				166	* Look into being smarter about generating NFA states for large Unicode
				167	character classes. These can create a lot of additional work for both the
				168	determinizer and the minimizer, and I suspect this is the key thing we'll
				169	want to improve if we want to make DFA compile times faster. I believe
				170	it's possible to potentially build minimal or nearly minimal NFAs for the
				171	special case of Unicode character classes by leveraging Daciuk's algorithms
				172	for building minimal automata in linear time for sets of strings. See
				173	https://blog.burntsushi.net/transducers/#construction for more details. The
				174	key adaptation I think we need to make is to modify the algorithm to operate
				175	on byte ranges instead of enumerating every codepoint in the set. Otherwise,
				176	it might not be worth doing.
				177	* Add support for regex sets. It should be possible to do this by "simply"
				178	introducing more match states. I think we can also report the positions at
				179	each match, similar to how Aho-Corasick works. I think the long pole in the
				180	tent here is probably the API design work and arranging it so that we don't
				181	introduce extra overhead into the non-regex-set case without duplicating a
				182	lot of code. It seems doable.
				183	* Stretch goal: support capturing groups by implementing "tagged" DFA
				184	(transducers). Laurikari's paper is the usual reference here, but Trofimovich
				185	has a much more thorough treatment here:
Joel Galenson	8249a3d	2021-06-21 14:01:00 -0700	[diff] [blame]	186	https://re2c.org/2017_trofimovich_tagged_deterministic_finite_automata_with_lookahead.pdf
Jakub Kotur	3bceaeb	2020-12-21 17:28:16 +0100	[diff] [blame]	187	I've only read the paper once. I suspect it will require at least a few more
				188	read throughs before I understand it.
Joel Galenson	8249a3d	2021-06-21 14:01:00 -0700	[diff] [blame]	189	See also: https://re2c.org
Jakub Kotur	3bceaeb	2020-12-21 17:28:16 +0100	[diff] [blame]	190	* Possibly less ambitious goal: can we select a portion of Trofimovich's work
				191	to make small fixed length look-around work? It would be really nice to
				192	support ^, $ and \b, especially the Unicode variant of \b and CRLF aware $.
				193	* Experiment with code generating Rust code. There is an early experiment in
				194	src/codegen.rs that is thoroughly bit-rotted. At the time, I was
				195	experimenting with whether or not codegen would significant decrease the size
				196	of a DFA, since if you squint hard enough, it's kind of like a sparse
				197	representation. However, it didn't shrink as much as I thought it would, so
				198	I gave up. The other problem is that Rust doesn't support gotos, so I don't
				199	even know whether the "match on each state" in a loop thing will be fast
				200	enough. Either way, it's probably a good option to have. For one thing, it
				201	would be endian independent where as the serialization format of the DFAs in
				202	this crate are endian dependent (so you need two versions of every DFA, but
				203	you only need to compile one of them for any given arch).
				204	* Experiment with unrolling the match loops and fill out the benchmarks.
				205	* Add some kind of streaming API. I believe users of the library can already
				206	implement something for this outside of the crate, but it would be good to
				207	provide an official API. The key thing here is figuring out the API. I
				208	suspect we might want to support several variants.
				209	* Make a decision on whether or not there is room for literal optimizations
				210	in this crate. My original intent was to not let this crate sink down into
				211	that very very very deep rabbit hole. But instead, we might want to provide
				212	some way for literal optimizations to hook into the match routines. The right
				213	path forward here is to probably build something outside of the crate and
				214	then see about integrating it. After all, users can implement their own
				215	match routines just as efficiently as what the crate provides.
				216	* A key downside of DFAs is that they can take up a lot of memory and can be
				217	quite costly to build. Their worst case compilation time is O(2^n), where
				218	n is the number of NFA states. A paper by Yang and Prasanna (2011) actually
				219	seems to provide a way to character state blow up such that it is detectable.
				220	If we could know whether a regex will exhibit state explosion or not, then
				221	we could make an intelligent decision about whether to ahead-of-time compile
				222	a DFA.
Joel Galenson	8249a3d	2021-06-21 14:01:00 -0700	[diff] [blame]	223	See: https://www.researchgate.net/profile/Xu-Shutu/publication/229032602_Characterization_of_a_global_germplasm_collection_and_its_potential_utilization_for_analysis_of_complex_quantitative_traits_in_maize/links/02bfe50f914d04c837000000/Characterization-of-a-global-germplasm-collection-and-its-potential-utilization-for-analysis-of-complex-quantitative-traits-in-maize.pdf