Blame - crates/regex/HACKING.md - platform/external/rust/android-crates-io

blob: 34af5b517cd96109ff29dcd2d83e7b0b0621ffb8 [file] [log] [blame] [view]

James Farrell	98c7370	2024-08-28 22:55:06 +0000	[diff] [blame]	1	Your friendly guide to hacking and navigating the regex library.
				2
				3	This guide assumes familiarity with Rust and Cargo, and at least a perusal of
				4	the user facing documentation for this crate.
				5
				6	If you're looking for background on the implementation in this library, then
				7	you can do no better than Russ Cox's article series on implementing regular
				8	expressions using finite automata: https://swtch.com/~rsc/regexp/
				9
				10
				11	## Architecture overview
				12
				13	As you probably already know, this library executes regular expressions using
				14	finite automata. In particular, a design goal is to make searching linear
				15	with respect to both the regular expression and the text being searched.
				16	Meeting that design goal on its own is not so hard and can be done with an
				17	implementation of the Pike VM (similar to Thompson's construction, but supports
				18	capturing groups), as described in: https://swtch.com/~rsc/regexp/regexp2.html
				19	--- This library contains such an implementation in src/pikevm.rs.
				20
				21	Making it fast is harder. One of the key problems with the Pike VM is that it
				22	can be in more than one state at any point in time, and must shuffle capture
				23	positions between them. The Pike VM also spends a lot of time following the
				24	same epsilon transitions over and over again. We can employ one trick to
				25	speed up the Pike VM: extract one or more literal prefixes from the regular
				26	expression and execute specialized code to quickly find matches of those
				27	prefixes in the search text. The Pike VM can then be avoided for most the
				28	search, and instead only executed when a prefix is found. The code to find
				29	prefixes is in the regex-syntax crate (in this repository). The code to search
				30	for literals is in src/literals.rs. When more than one literal prefix is found,
				31	we fall back to an Aho-Corasick DFA using the aho-corasick crate. For one
				32	literal, we use a variant of the Boyer-Moore algorithm. Both Aho-Corasick and
				33	Boyer-Moore use `memchr` when appropriate. The Boyer-Moore variant in this
				34	library also uses elementary frequency analysis to choose the right byte to run
				35	`memchr` with.
				36
				37	Of course, detecting prefix literals can only take us so far. Not all regular
				38	expressions have literal prefixes. To remedy this, we try another approach
				39	to executing the Pike VM: backtracking, whose implementation can be found in
				40	src/backtrack.rs. One reason why backtracking can be faster is that it avoids
				41	excessive shuffling of capture groups. Of course, backtracking is susceptible
				42	to exponential runtimes, so we keep track of every state we've visited to make
				43	sure we never visit it again. This guarantees linear time execution, but we
				44	pay for it with the memory required to track visited states. Because of the
				45	memory requirement, we only use this engine on small search strings and small
				46	regular expressions.
				47
				48	Lastly, the real workhorse of this library is the "lazy" DFA in src/dfa.rs.
				49	It is distinct from the Pike VM in that the DFA is explicitly represented in
				50	memory and is only ever in one state at a time. It is said to be "lazy" because
				51	the DFA is computed as text is searched, where each byte in the search text
				52	results in at most one new DFA state. It is made fast by caching states. DFAs
				53	are susceptible to exponential state blow up (where the worst case is computing
				54	a new state for every input byte, regardless of what's in the state cache). To
				55	avoid using a lot of memory, the lazy DFA uses a bounded cache. Once the cache
				56	is full, it is wiped and state computation starts over again. If the cache is
				57	wiped too frequently, then the DFA gives up and searching falls back to one of
				58	the aforementioned algorithms.
				59
				60	All of the above matching engines expose precisely the same matching semantics.
				61	This is indeed tested. (See the section below about testing.)
				62
				63	The following sub-sections describe the rest of the library and how each of the
				64	matching engines are actually used.
				65
				66	### Parsing
				67
				68	Regular expressions are parsed using the regex-syntax crate, which is
				69	maintained in this repository. The regex-syntax crate defines an abstract
				70	syntax and provides very detailed error messages when a parse error is
				71	encountered. Parsing is done in a separate crate so that others may benefit
				72	from its existence, and because it is relatively divorced from the rest of the
				73	regex library.
				74
				75	The regex-syntax crate also provides sophisticated support for extracting
				76	prefix and suffix literals from regular expressions.
				77
				78	### Compilation
				79
				80	The compiler is in src/compile.rs. The input to the compiler is some abstract
				81	syntax for a regular expression and the output is a sequence of opcodes that
				82	matching engines use to execute a search. (One can think of matching engines as
				83	mini virtual machines.) The sequence of opcodes is a particular encoding of a
				84	non-deterministic finite automaton. In particular, the opcodes explicitly rely
				85	on epsilon transitions.
				86
				87	Consider a simple regular expression like `a\|b`. Its compiled form looks like
				88	this:
				89
				90	000 Save(0)
				91	001 Split(2, 3)
				92	002 'a' (goto: 4)
				93	003 'b'
				94	004 Save(1)
				95	005 Match
				96
				97	The first column is the instruction pointer and the second column is the
				98	instruction. Save instructions indicate that the current position in the input
				99	should be stored in a captured location. Split instructions represent a binary
				100	branch in the program (i.e., epsilon transitions). The instructions `'a'` and
				101	`'b'` indicate that the literal bytes `'a'` or `'b'` should match.
				102
				103	In older versions of this library, the compilation looked like this:
				104
				105	000 Save(0)
				106	001 Split(2, 3)
				107	002 'a'
				108	003 Jump(5)
				109	004 'b'
				110	005 Save(1)
				111	006 Match
				112
				113	In particular, empty instructions that merely served to move execution from one
				114	point in the program to another were removed. Instead, every instruction has a
				115	`goto` pointer embedded into it. This resulted in a small performance boost for
				116	the Pike VM, because it was one fewer epsilon transition that it had to follow.
				117
				118	There exist more instructions and they are defined and documented in
				119	src/prog.rs.
				120
				121	Compilation has several knobs and a few unfortunately complicated invariants.
				122	Namely, the output of compilation can be one of two types of programs: a
				123	program that executes on Unicode scalar values or a program that executes
				124	on raw bytes. In the former case, the matching engine is responsible for
				125	performing UTF-8 decoding and executing instructions using Unicode codepoints.
				126	In the latter case, the program handles UTF-8 decoding implicitly, so that the
				127	matching engine can execute on raw bytes. All matching engines can execute
				128	either Unicode or byte based programs except for the lazy DFA, which requires
				129	byte based programs. In general, both representations were kept because (1) the
				130	lazy DFA requires byte based programs so that states can be encoded in a memory
				131	efficient manner and (2) the Pike VM benefits greatly from inlining Unicode
				132	character classes into fewer instructions as it results in fewer epsilon
				133	transitions.
				134
				135	N.B. UTF-8 decoding is built into the compiled program by making use of the
				136	utf8-ranges crate. The compiler in this library factors out common suffixes to
				137	reduce the size of huge character classes (e.g., `\pL`).
				138
				139	A regrettable consequence of this split in instruction sets is we generally
				140	need to compile two programs; one for NFA execution and one for the lazy DFA.
				141
				142	In fact, it is worse than that: the lazy DFA is not capable of finding the
				143	starting location of a match in a single scan, and must instead execute a
				144	backwards search after finding the end location. To execute a backwards search,
				145	we must have compiled the regular expression in reverse.
				146
				147	This means that every compilation of a regular expression generally results in
				148	three distinct programs. It would be possible to lazily compile the Unicode
				149	program, since it is never needed if (1) the regular expression uses no word
				150	boundary assertions and (2) the caller never asks for sub-capture locations.
				151
				152	### Execution
				153
				154	At the time of writing, there are four matching engines in this library:
				155
				156	1. The Pike VM (supports captures).
				157	2. Bounded backtracking (supports captures).
				158	3. Literal substring or multi-substring search.
				159	4. Lazy DFA (no support for Unicode word boundary assertions).
				160
				161	Only the first two matching engines are capable of executing every regular
				162	expression program. They also happen to be the slowest, which means we need
				163	some logic that (1) knows various facts about the regular expression and (2)
				164	knows what the caller wants. Using this information, we can determine which
				165	engine (or engines) to use.
				166
				167	The logic for choosing which engine to execute is in src/exec.rs and is
				168	documented on the Exec type. Exec values contain regular expression Programs
				169	(defined in src/prog.rs), which contain all the necessary tidbits for actually
				170	executing a regular expression on search text.
				171
				172	For the most part, the execution logic is straight-forward and follows the
				173	limitations of each engine described above pretty faithfully. The hairiest
				174	part of src/exec.rs by far is the execution of the lazy DFA, since it requires
				175	a forwards and backwards search, and then falls back to either the Pike VM or
				176	backtracking if the caller requested capture locations.
				177
				178	The Exec type also contains mutable scratch space for each type of matching
				179	engine. This scratch space is used during search (for example, for the lazy
				180	DFA, it contains compiled states that are reused on subsequent searches).
				181
				182	### Programs
				183
				184	A regular expression program is essentially a sequence of opcodes produced by
				185	the compiler plus various facts about the regular expression (such as whether
				186	it is anchored, its capture names, etc.).
				187
				188	### The regex! macro
				189
				190	The `regex!` macro no longer exists. It was developed in a bygone era as a
				191	compiler plugin during the infancy of the regex crate. Back then, then only
				192	matching engine in the crate was the Pike VM. The `regex!` macro was, itself,
				193	also a Pike VM. The only advantages it offered over the dynamic Pike VM that
				194	was built at runtime were the following:
				195
				196	1. Syntax checking was done at compile time. Your Rust program wouldn't
				197	compile if your regex didn't compile.
				198	2. Reduction of overhead that was proportional to the size of the regex.
				199	For the most part, this overhead consisted of heap allocation, which
				200	was nearly eliminated in the compiler plugin.
				201
				202	The main takeaway here is that the compiler plugin was a marginally faster
				203	version of a slow regex engine. As the regex crate evolved, it grew other regex
				204	engines (DFA, bounded backtracker) and sophisticated literal optimizations.
				205	The regex macro didn't keep pace, and it therefore became (dramatically) slower
				206	than the dynamic engines. The only reason left to use it was for the compile
				207	time guarantee that your regex is correct. Fortunately, Clippy (the Rust lint
				208	tool) has a lint that checks your regular expression validity, which mostly
				209	replaces that use case.
				210
				211	Additionally, the regex compiler plugin stopped receiving maintenance. Nobody
				212	complained. At that point, it seemed prudent to just remove it.
				213
				214	Will a compiler plugin be brought back? The future is murky, but there is
				215	definitely an opportunity there to build something that is faster than the
				216	dynamic engines in some cases. But it will be challenging! As of now, there
				217	are no plans to work on this.
				218
				219
				220	## Testing
				221
				222	A key aspect of any mature regex library is its test suite. A subset of the
				223	tests in this library come from Glenn Fowler's AT&T test suite (its online
				224	presence seems gone at the time of writing). The source of the test suite is
				225	located in src/testdata. The scripts/regex-match-tests.py takes the test suite
				226	in src/testdata and generates tests/matches.rs.
				227
				228	There are also many other manually crafted tests and regression tests in
				229	tests/tests.rs. Some of these tests were taken from RE2.
				230
				231	The biggest source of complexity in the tests is related to answering this
				232	question: how can we reuse the tests to check all of our matching engines? One
				233	approach would have been to encode every test into some kind of format (like
				234	the AT&T test suite) and code generate tests for each matching engine. The
				235	approach we use in this library is to create a Cargo.toml entry point for each
				236	matching engine we want to test. The entry points are:
				237
				238	* `tests/test_default.rs` - tests `Regex::new`
				239	* `tests/test_default_bytes.rs` - tests `bytes::Regex::new`
				240	* `tests/test_nfa.rs` - tests `Regex::new`, forced to use the NFA
				241	algorithm on every regex.
				242	* `tests/test_nfa_bytes.rs` - tests `Regex::new`, forced to use the NFA
				243	algorithm on every regex and use arbitrary byte based programs.
				244	* `tests/test_nfa_utf8bytes.rs` - tests `Regex::new`, forced to use the NFA
				245	algorithm on every regex and use UTF-8 byte based programs.
				246	* `tests/test_backtrack.rs` - tests `Regex::new`, forced to use
				247	backtracking on every regex.
				248	* `tests/test_backtrack_bytes.rs` - tests `Regex::new`, forced to use
				249	backtracking on every regex and use arbitrary byte based programs.
				250	* `tests/test_backtrack_utf8bytes.rs` - tests `Regex::new`, forced to use
				251	backtracking on every regex and use UTF-8 byte based programs.
				252	* `tests/test_crates_regex.rs` - tests to make sure that all of the
				253	backends behave in the same way against a number of quickcheck
				254	generated random inputs. These tests need to be enabled through
				255	the `RUST_REGEX_RANDOM_TEST` environment variable (see
				256	below).
				257
				258	The lazy DFA and pure literal engines are absent from this list because
				259	they cannot be used on every regular expression. Instead, we rely on
				260	`tests/test_dynamic.rs` to test the lazy DFA and literal engines when possible.
				261
				262	Since the tests are repeated several times, and because `cargo test` runs all
				263	entry points, it can take a while to compile everything. To reduce compile
				264	times slightly, try using `cargo test --test default`, which will only use the
				265	`tests/test_default.rs` entry point.
				266
				267	The random testing takes quite a while, so it is not enabled by default.
				268	In order to run the random testing you can set the
				269	`RUST_REGEX_RANDOM_TEST` environment variable to anything before
				270	invoking `cargo test`. Note that this variable is inspected at compile
				271	time, so if the tests don't seem to be running, you may need to run
				272	`cargo clean`.
				273
				274	## Benchmarking
				275
				276	The benchmarking in this crate is made up of many micro-benchmarks. Currently,
				277	there are two primary sets of benchmarks: the benchmarks that were adopted
				278	at this library's inception (in `bench/src/misc.rs`) and a newer set of
				279	benchmarks meant to test various optimizations. Specifically, the latter set
				280	contain some analysis and are in `bench/src/sherlock.rs`. Also, the latter
				281	set are all executed on the same lengthy input whereas the former benchmarks
				282	are executed on strings of varying length.
				283
				284	There is also a smattering of benchmarks for parsing and compilation.
				285
				286	Benchmarks are in a separate crate so that its dependencies can be managed
				287	separately from the main regex crate.
				288
				289	Benchmarking follows a similarly wonky setup as tests. There are multiple entry
				290	points:
				291
				292	* `bench_rust.rs` - benchmarks `Regex::new`
				293	* `bench_rust_bytes.rs` benchmarks `bytes::Regex::new`
				294	* `bench_pcre.rs` - benchmarks PCRE
				295	* `bench_onig.rs` - benchmarks Oniguruma
				296
				297	The PCRE and Oniguruma benchmarks exist as a comparison point to a mature
				298	regular expression library. In general, this regex library compares favorably
				299	(there are even a few benchmarks that PCRE simply runs too slowly on or
				300	outright can't execute at all). I would love to add other regular expression
				301	library benchmarks (especially RE2).
				302
				303	If you're hacking on one of the matching engines and just want to see
				304	benchmarks, then all you need to run is:
				305
				306	$ (cd bench && ./run rust)
				307
				308	If you want to compare your results with older benchmarks, then try:
				309
				310	$ (cd bench && ./run rust \| tee old)
				311	$ ... make it faster
				312	$ (cd bench && ./run rust \| tee new)
				313	$ cargo benchcmp old new --improvements
				314
				315	The `cargo-benchcmp` utility is available here:
				316	https://github.com/BurntSushi/cargo-benchcmp
				317
				318	The `./bench/run` utility can run benchmarks for PCRE and Oniguruma too. See
				319	`./bench/bench --help`.
				320
				321	## Dev Docs
				322
				323	When digging your teeth into the codebase for the first time, the
				324	crate documentation can be a great resource. By default `rustdoc`
				325	will strip out all documentation of private crate members in an
				326	effort to help consumers of the crate focus on the interface
				327	without having to concern themselves with the implementation.
				328	Normally this is a great thing, but if you want to start hacking
				329	on regex internals it is not what you want. Many of the private members
				330	of this crate are well documented with rustdoc style comments, and
				331	it would be a shame to miss out on the opportunity that presents.
				332	You can generate the private docs with:
				333
				334	```
				335	$ rustdoc --crate-name docs src/lib.rs -o target/doc -L target/debug/deps --no-defaults --passes collapse-docs --passes unindent-comments
				336	```
				337
				338	Then just point your browser at `target/doc/regex/index.html`.
				339
				340	See https://github.com/rust-lang/rust/issues/15347 for more info
				341	about generating developer docs for internal use.