| Your friendly guide to hacking and navigating the regex library. |
| |
| This guide assumes familiarity with Rust and Cargo, and at least a perusal of |
| the user facing documentation for this crate. |
| |
| If you're looking for background on the implementation in this library, then |
| you can do no better than Russ Cox's article series on implementing regular |
| expressions using finite automata: https://swtch.com/~rsc/regexp/ |
| |
| |
| ## Architecture overview |
| |
| As you probably already know, this library executes regular expressions using |
| finite automata. In particular, a design goal is to make searching linear |
| with respect to both the regular expression and the text being searched. |
| Meeting that design goal on its own is not so hard and can be done with an |
| implementation of the Pike VM (similar to Thompson's construction, but supports |
| capturing groups), as described in: https://swtch.com/~rsc/regexp/regexp2.html |
| --- This library contains such an implementation in src/pikevm.rs. |
| |
| Making it fast is harder. One of the key problems with the Pike VM is that it |
| can be in more than one state at any point in time, and must shuffle capture |
| positions between them. The Pike VM also spends a lot of time following the |
| same epsilon transitions over and over again. We can employ one trick to |
| speed up the Pike VM: extract one or more literal prefixes from the regular |
| expression and execute specialized code to quickly find matches of those |
| prefixes in the search text. The Pike VM can then be avoided for most the |
| search, and instead only executed when a prefix is found. The code to find |
| prefixes is in the regex-syntax crate (in this repository). The code to search |
| for literals is in src/literals.rs. When more than one literal prefix is found, |
| we fall back to an Aho-Corasick DFA using the aho-corasick crate. For one |
| literal, we use a variant of the Boyer-Moore algorithm. Both Aho-Corasick and |
| Boyer-Moore use `memchr` when appropriate. The Boyer-Moore variant in this |
| library also uses elementary frequency analysis to choose the right byte to run |
| `memchr` with. |
| |
| Of course, detecting prefix literals can only take us so far. Not all regular |
| expressions have literal prefixes. To remedy this, we try another approach |
| to executing the Pike VM: backtracking, whose implementation can be found in |
| src/backtrack.rs. One reason why backtracking can be faster is that it avoids |
| excessive shuffling of capture groups. Of course, backtracking is susceptible |
| to exponential runtimes, so we keep track of every state we've visited to make |
| sure we never visit it again. This guarantees linear time execution, but we |
| pay for it with the memory required to track visited states. Because of the |
| memory requirement, we only use this engine on small search strings *and* small |
| regular expressions. |
| |
| Lastly, the real workhorse of this library is the "lazy" DFA in src/dfa.rs. |
| It is distinct from the Pike VM in that the DFA is explicitly represented in |
| memory and is only ever in one state at a time. It is said to be "lazy" because |
| the DFA is computed as text is searched, where each byte in the search text |
| results in at most one new DFA state. It is made fast by caching states. DFAs |
| are susceptible to exponential state blow up (where the worst case is computing |
| a new state for every input byte, regardless of what's in the state cache). To |
| avoid using a lot of memory, the lazy DFA uses a bounded cache. Once the cache |
| is full, it is wiped and state computation starts over again. If the cache is |
| wiped too frequently, then the DFA gives up and searching falls back to one of |
| the aforementioned algorithms. |
| |
| All of the above matching engines expose precisely the same matching semantics. |
| This is indeed tested. (See the section below about testing.) |
| |
| The following sub-sections describe the rest of the library and how each of the |
| matching engines are actually used. |
| |
| ### Parsing |
| |
| Regular expressions are parsed using the regex-syntax crate, which is |
| maintained in this repository. The regex-syntax crate defines an abstract |
| syntax and provides very detailed error messages when a parse error is |
| encountered. Parsing is done in a separate crate so that others may benefit |
| from its existence, and because it is relatively divorced from the rest of the |
| regex library. |
| |
| The regex-syntax crate also provides sophisticated support for extracting |
| prefix and suffix literals from regular expressions. |
| |
| ### Compilation |
| |
| The compiler is in src/compile.rs. The input to the compiler is some abstract |
| syntax for a regular expression and the output is a sequence of opcodes that |
| matching engines use to execute a search. (One can think of matching engines as |
| mini virtual machines.) The sequence of opcodes is a particular encoding of a |
| non-deterministic finite automaton. In particular, the opcodes explicitly rely |
| on epsilon transitions. |
| |
| Consider a simple regular expression like `a|b`. Its compiled form looks like |
| this: |
| |
| 000 Save(0) |
| 001 Split(2, 3) |
| 002 'a' (goto: 4) |
| 003 'b' |
| 004 Save(1) |
| 005 Match |
| |
| The first column is the instruction pointer and the second column is the |
| instruction. Save instructions indicate that the current position in the input |
| should be stored in a captured location. Split instructions represent a binary |
| branch in the program (i.e., epsilon transitions). The instructions `'a'` and |
| `'b'` indicate that the literal bytes `'a'` or `'b'` should match. |
| |
| In older versions of this library, the compilation looked like this: |
| |
| 000 Save(0) |
| 001 Split(2, 3) |
| 002 'a' |
| 003 Jump(5) |
| 004 'b' |
| 005 Save(1) |
| 006 Match |
| |
| In particular, empty instructions that merely served to move execution from one |
| point in the program to another were removed. Instead, every instruction has a |
| `goto` pointer embedded into it. This resulted in a small performance boost for |
| the Pike VM, because it was one fewer epsilon transition that it had to follow. |
| |
| There exist more instructions and they are defined and documented in |
| src/prog.rs. |
| |
| Compilation has several knobs and a few unfortunately complicated invariants. |
| Namely, the output of compilation can be one of two types of programs: a |
| program that executes on Unicode scalar values or a program that executes |
| on raw bytes. In the former case, the matching engine is responsible for |
| performing UTF-8 decoding and executing instructions using Unicode codepoints. |
| In the latter case, the program handles UTF-8 decoding implicitly, so that the |
| matching engine can execute on raw bytes. All matching engines can execute |
| either Unicode or byte based programs except for the lazy DFA, which requires |
| byte based programs. In general, both representations were kept because (1) the |
| lazy DFA requires byte based programs so that states can be encoded in a memory |
| efficient manner and (2) the Pike VM benefits greatly from inlining Unicode |
| character classes into fewer instructions as it results in fewer epsilon |
| transitions. |
| |
| N.B. UTF-8 decoding is built into the compiled program by making use of the |
| utf8-ranges crate. The compiler in this library factors out common suffixes to |
| reduce the size of huge character classes (e.g., `\pL`). |
| |
| A regrettable consequence of this split in instruction sets is we generally |
| need to compile two programs; one for NFA execution and one for the lazy DFA. |
| |
| In fact, it is worse than that: the lazy DFA is not capable of finding the |
| starting location of a match in a single scan, and must instead execute a |
| backwards search after finding the end location. To execute a backwards search, |
| we must have compiled the regular expression *in reverse*. |
| |
| This means that every compilation of a regular expression generally results in |
| three distinct programs. It would be possible to lazily compile the Unicode |
| program, since it is never needed if (1) the regular expression uses no word |
| boundary assertions and (2) the caller never asks for sub-capture locations. |
| |
| ### Execution |
| |
| At the time of writing, there are four matching engines in this library: |
| |
| 1. The Pike VM (supports captures). |
| 2. Bounded backtracking (supports captures). |
| 3. Literal substring or multi-substring search. |
| 4. Lazy DFA (no support for Unicode word boundary assertions). |
| |
| Only the first two matching engines are capable of executing every regular |
| expression program. They also happen to be the slowest, which means we need |
| some logic that (1) knows various facts about the regular expression and (2) |
| knows what the caller wants. Using this information, we can determine which |
| engine (or engines) to use. |
| |
| The logic for choosing which engine to execute is in src/exec.rs and is |
| documented on the Exec type. Exec values contain regular expression Programs |
| (defined in src/prog.rs), which contain all the necessary tidbits for actually |
| executing a regular expression on search text. |
| |
| For the most part, the execution logic is straight-forward and follows the |
| limitations of each engine described above pretty faithfully. The hairiest |
| part of src/exec.rs by far is the execution of the lazy DFA, since it requires |
| a forwards and backwards search, and then falls back to either the Pike VM or |
| backtracking if the caller requested capture locations. |
| |
| The Exec type also contains mutable scratch space for each type of matching |
| engine. This scratch space is used during search (for example, for the lazy |
| DFA, it contains compiled states that are reused on subsequent searches). |
| |
| ### Programs |
| |
| A regular expression program is essentially a sequence of opcodes produced by |
| the compiler plus various facts about the regular expression (such as whether |
| it is anchored, its capture names, etc.). |
| |
| ### The regex! macro (or why `regex::internal` exists) |
| |
| The `regex!` macro is defined in the `regex_macros` crate as a compiler plugin, |
| which is maintained in this repository. The `regex!` macro compiles a regular |
| expression at compile time into specialized Rust code. |
| |
| The `regex!` macro was written when this library was first conceived and |
| unfortunately hasn't changed much since then. In particular, it encodes the |
| entire Pike VM into stack allocated space (no heap allocation is done). When |
| `regex!` was first written, this provided a substantial speed boost over |
| so-called "dynamic" regexes compiled at runtime, and in particular had much |
| lower overhead per match. This was because the only matching engine at the |
| time was the Pike VM. The addition of other matching engines has inverted |
| the relationship; the `regex!` macro is almost never faster than the dynamic |
| variant. (In fact, it is typically substantially slower.) |
| |
| In order to build the `regex!` macro this way, it must have access to some |
| internals of the regex library, which is in a distinct crate. (Compiler plugins |
| must be part of a distinct crate.) Namely, it must be able to compile a regular |
| expression and access its opcodes. The necessary internals are exported as part |
| of the top-level `internal` module in the regex library, but is hidden from |
| public documentation. In order to present a uniform API between programs build |
| by the `regex!` macro and their dynamic analoges, the `Regex` type is an enum |
| whose variants are hidden from public documentation. |
| |
| In the future, the `regex!` macro should probably work more like Ragel, but |
| it's not clear how hard this is. In particular, the `regex!` macro should be |
| able to support all the features of dynamic regexes, which may be hard to do |
| with a Ragel-style implementation approach. (Which somewhat suggests that the |
| `regex!` macro may also need to grow conditional execution logic like the |
| dynamic variants, which seems rather grotesque.) |
| |
| |
| ## Testing |
| |
| A key aspect of any mature regex library is its test suite. A subset of the |
| tests in this library come from Glenn Fowler's AT&T test suite (its online |
| presence seems gone at the time of writing). The source of the test suite is |
| located in src/testdata. The scripts/regex-match-tests.py takes the test suite |
| in src/testdata and generates tests/matches.rs. |
| |
| There are also many other manually crafted tests and regression tests in |
| tests/tests.rs. Some of these tests were taken from RE2. |
| |
| The biggest source of complexity in the tests is related to answering this |
| question: how can we reuse the tests to check all of our matching engines? One |
| approach would have been to encode every test into some kind of format (like |
| the AT&T test suite) and code generate tests for each matching engine. The |
| approach we use in this library is to create a Cargo.toml entry point for each |
| matching engine we want to test. The entry points are: |
| |
| * `tests/test_plugin.rs` - tests the `regex!` macro |
| * `tests/test_default.rs` - tests `Regex::new` |
| * `tests/test_default_bytes.rs` - tests `bytes::Regex::new` |
| * `tests/test_nfa.rs` - tests `Regex::new`, forced to use the NFA |
| algorithm on every regex. |
| * `tests/test_nfa_bytes.rs` - tests `Regex::new`, forced to use the NFA |
| algorithm on every regex and use *arbitrary* byte based programs. |
| * `tests/test_nfa_utf8bytes.rs` - tests `Regex::new`, forced to use the NFA |
| algorithm on every regex and use *UTF-8* byte based programs. |
| * `tests/test_backtrack.rs` - tests `Regex::new`, forced to use |
| backtracking on every regex. |
| * `tests/test_backtrack_bytes.rs` - tests `Regex::new`, forced to use |
| backtracking on every regex and use *arbitrary* byte based programs. |
| * `tests/test_backtrack_utf8bytes.rs` - tests `Regex::new`, forced to use |
| backtracking on every regex and use *UTF-8* byte based programs. |
| |
| The lazy DFA and pure literal engines are absent from this list because |
| they cannot be used on every regular expression. Instead, we rely on |
| `tests/test_dynamic.rs` to test the lazy DFA and literal engines when possible. |
| |
| Since the tests are repeated several times, and because `cargo test` runs all |
| entry points, it can take a while to compile everything. To reduce compile |
| times slightly, try using `cargo test --test default`, which will only use the |
| `tests/test_default.rs` entry point. |
| |
| N.B. To run tests for the `regex!` macro, use: |
| |
| cargo test --manifest-path regex_macros/Cargo.toml |
| |
| |
| ## Benchmarking |
| |
| The benchmarking in this crate is made up of many micro-benchmarks. Currently, |
| there are two primary sets of benchmarks: the benchmarks that were adopted |
| at this library's inception (in `benches/src/misc.rs`) and a newer set of |
| benchmarks meant to test various optimizations. Specifically, the latter set |
| contain some analysis and are in `benches/src/sherlock.rs`. Also, the latter |
| set are all executed on the same lengthy input whereas the former benchmarks |
| are executed on strings of varying length. |
| |
| There is also a smattering of benchmarks for parsing and compilation. |
| |
| Benchmarks are in a separate crate so that its dependencies can be managed |
| separately from the main regex crate. |
| |
| Benchmarking follows a similarly wonky setup as tests. There are multiple entry |
| points: |
| |
| * `bench_rust_plugin.rs` - benchmarks the `regex!` macro |
| * `bench_rust.rs` - benchmarks `Regex::new` |
| * `bench_rust_bytes.rs` benchmarks `bytes::Regex::new` |
| * `bench_pcre.rs` - benchmarks PCRE |
| * `bench_onig.rs` - benchmarks Oniguruma |
| |
| The PCRE and Oniguruma benchmarks exist as a comparison point to a mature |
| regular expression library. In general, this regex library compares favorably |
| (there are even a few benchmarks that PCRE simply runs too slowly on or |
| outright can't execute at all). I would love to add other regular expression |
| library benchmarks (especially RE2). |
| |
| If you're hacking on one of the matching engines and just want to see |
| benchmarks, then all you need to run is: |
| |
| $ ./run-bench rust |
| |
| If you want to compare your results with older benchmarks, then try: |
| |
| $ ./run-bench rust | tee old |
| $ ... make it faster |
| $ ./run-bench rust | tee new |
| $ cargo-benchcmp old new --improvements |
| |
| The `cargo-benchcmp` utility is available here: |
| https://github.com/BurntSushi/cargo-benchcmp |
| |
| The `run-bench` utility can run benchmarks for PCRE and Oniguruma too. See |
| `./run-bench --help`. |