James Farrell | 98c7370 | 2024-08-28 22:55:06 +0000 | [diff] [blame] | 1 | Your friendly guide to hacking and navigating the regex library. |
| 2 | |
| 3 | This guide assumes familiarity with Rust and Cargo, and at least a perusal of |
| 4 | the user facing documentation for this crate. |
| 5 | |
| 6 | If you're looking for background on the implementation in this library, then |
| 7 | you can do no better than Russ Cox's article series on implementing regular |
| 8 | expressions using finite automata: https://swtch.com/~rsc/regexp/ |
| 9 | |
| 10 | |
| 11 | ## Architecture overview |
| 12 | |
| 13 | As you probably already know, this library executes regular expressions using |
| 14 | finite automata. In particular, a design goal is to make searching linear |
| 15 | with respect to both the regular expression and the text being searched. |
| 16 | Meeting that design goal on its own is not so hard and can be done with an |
| 17 | implementation of the Pike VM (similar to Thompson's construction, but supports |
| 18 | capturing groups), as described in: https://swtch.com/~rsc/regexp/regexp2.html |
| 19 | --- This library contains such an implementation in src/pikevm.rs. |
| 20 | |
| 21 | Making it fast is harder. One of the key problems with the Pike VM is that it |
| 22 | can be in more than one state at any point in time, and must shuffle capture |
| 23 | positions between them. The Pike VM also spends a lot of time following the |
| 24 | same epsilon transitions over and over again. We can employ one trick to |
| 25 | speed up the Pike VM: extract one or more literal prefixes from the regular |
| 26 | expression and execute specialized code to quickly find matches of those |
| 27 | prefixes in the search text. The Pike VM can then be avoided for most the |
| 28 | search, and instead only executed when a prefix is found. The code to find |
| 29 | prefixes is in the regex-syntax crate (in this repository). The code to search |
| 30 | for literals is in src/literals.rs. When more than one literal prefix is found, |
| 31 | we fall back to an Aho-Corasick DFA using the aho-corasick crate. For one |
| 32 | literal, we use a variant of the Boyer-Moore algorithm. Both Aho-Corasick and |
| 33 | Boyer-Moore use `memchr` when appropriate. The Boyer-Moore variant in this |
| 34 | library also uses elementary frequency analysis to choose the right byte to run |
| 35 | `memchr` with. |
| 36 | |
| 37 | Of course, detecting prefix literals can only take us so far. Not all regular |
| 38 | expressions have literal prefixes. To remedy this, we try another approach |
| 39 | to executing the Pike VM: backtracking, whose implementation can be found in |
| 40 | src/backtrack.rs. One reason why backtracking can be faster is that it avoids |
| 41 | excessive shuffling of capture groups. Of course, backtracking is susceptible |
| 42 | to exponential runtimes, so we keep track of every state we've visited to make |
| 43 | sure we never visit it again. This guarantees linear time execution, but we |
| 44 | pay for it with the memory required to track visited states. Because of the |
| 45 | memory requirement, we only use this engine on small search strings *and* small |
| 46 | regular expressions. |
| 47 | |
| 48 | Lastly, the real workhorse of this library is the "lazy" DFA in src/dfa.rs. |
| 49 | It is distinct from the Pike VM in that the DFA is explicitly represented in |
| 50 | memory and is only ever in one state at a time. It is said to be "lazy" because |
| 51 | the DFA is computed as text is searched, where each byte in the search text |
| 52 | results in at most one new DFA state. It is made fast by caching states. DFAs |
| 53 | are susceptible to exponential state blow up (where the worst case is computing |
| 54 | a new state for every input byte, regardless of what's in the state cache). To |
| 55 | avoid using a lot of memory, the lazy DFA uses a bounded cache. Once the cache |
| 56 | is full, it is wiped and state computation starts over again. If the cache is |
| 57 | wiped too frequently, then the DFA gives up and searching falls back to one of |
| 58 | the aforementioned algorithms. |
| 59 | |
| 60 | All of the above matching engines expose precisely the same matching semantics. |
| 61 | This is indeed tested. (See the section below about testing.) |
| 62 | |
| 63 | The following sub-sections describe the rest of the library and how each of the |
| 64 | matching engines are actually used. |
| 65 | |
| 66 | ### Parsing |
| 67 | |
| 68 | Regular expressions are parsed using the regex-syntax crate, which is |
| 69 | maintained in this repository. The regex-syntax crate defines an abstract |
| 70 | syntax and provides very detailed error messages when a parse error is |
| 71 | encountered. Parsing is done in a separate crate so that others may benefit |
| 72 | from its existence, and because it is relatively divorced from the rest of the |
| 73 | regex library. |
| 74 | |
| 75 | The regex-syntax crate also provides sophisticated support for extracting |
| 76 | prefix and suffix literals from regular expressions. |
| 77 | |
| 78 | ### Compilation |
| 79 | |
| 80 | The compiler is in src/compile.rs. The input to the compiler is some abstract |
| 81 | syntax for a regular expression and the output is a sequence of opcodes that |
| 82 | matching engines use to execute a search. (One can think of matching engines as |
| 83 | mini virtual machines.) The sequence of opcodes is a particular encoding of a |
| 84 | non-deterministic finite automaton. In particular, the opcodes explicitly rely |
| 85 | on epsilon transitions. |
| 86 | |
| 87 | Consider a simple regular expression like `a|b`. Its compiled form looks like |
| 88 | this: |
| 89 | |
| 90 | 000 Save(0) |
| 91 | 001 Split(2, 3) |
| 92 | 002 'a' (goto: 4) |
| 93 | 003 'b' |
| 94 | 004 Save(1) |
| 95 | 005 Match |
| 96 | |
| 97 | The first column is the instruction pointer and the second column is the |
| 98 | instruction. Save instructions indicate that the current position in the input |
| 99 | should be stored in a captured location. Split instructions represent a binary |
| 100 | branch in the program (i.e., epsilon transitions). The instructions `'a'` and |
| 101 | `'b'` indicate that the literal bytes `'a'` or `'b'` should match. |
| 102 | |
| 103 | In older versions of this library, the compilation looked like this: |
| 104 | |
| 105 | 000 Save(0) |
| 106 | 001 Split(2, 3) |
| 107 | 002 'a' |
| 108 | 003 Jump(5) |
| 109 | 004 'b' |
| 110 | 005 Save(1) |
| 111 | 006 Match |
| 112 | |
| 113 | In particular, empty instructions that merely served to move execution from one |
| 114 | point in the program to another were removed. Instead, every instruction has a |
| 115 | `goto` pointer embedded into it. This resulted in a small performance boost for |
| 116 | the Pike VM, because it was one fewer epsilon transition that it had to follow. |
| 117 | |
| 118 | There exist more instructions and they are defined and documented in |
| 119 | src/prog.rs. |
| 120 | |
| 121 | Compilation has several knobs and a few unfortunately complicated invariants. |
| 122 | Namely, the output of compilation can be one of two types of programs: a |
| 123 | program that executes on Unicode scalar values or a program that executes |
| 124 | on raw bytes. In the former case, the matching engine is responsible for |
| 125 | performing UTF-8 decoding and executing instructions using Unicode codepoints. |
| 126 | In the latter case, the program handles UTF-8 decoding implicitly, so that the |
| 127 | matching engine can execute on raw bytes. All matching engines can execute |
| 128 | either Unicode or byte based programs except for the lazy DFA, which requires |
| 129 | byte based programs. In general, both representations were kept because (1) the |
| 130 | lazy DFA requires byte based programs so that states can be encoded in a memory |
| 131 | efficient manner and (2) the Pike VM benefits greatly from inlining Unicode |
| 132 | character classes into fewer instructions as it results in fewer epsilon |
| 133 | transitions. |
| 134 | |
| 135 | N.B. UTF-8 decoding is built into the compiled program by making use of the |
| 136 | utf8-ranges crate. The compiler in this library factors out common suffixes to |
| 137 | reduce the size of huge character classes (e.g., `\pL`). |
| 138 | |
| 139 | A regrettable consequence of this split in instruction sets is we generally |
| 140 | need to compile two programs; one for NFA execution and one for the lazy DFA. |
| 141 | |
| 142 | In fact, it is worse than that: the lazy DFA is not capable of finding the |
| 143 | starting location of a match in a single scan, and must instead execute a |
| 144 | backwards search after finding the end location. To execute a backwards search, |
| 145 | we must have compiled the regular expression *in reverse*. |
| 146 | |
| 147 | This means that every compilation of a regular expression generally results in |
| 148 | three distinct programs. It would be possible to lazily compile the Unicode |
| 149 | program, since it is never needed if (1) the regular expression uses no word |
| 150 | boundary assertions and (2) the caller never asks for sub-capture locations. |
| 151 | |
| 152 | ### Execution |
| 153 | |
| 154 | At the time of writing, there are four matching engines in this library: |
| 155 | |
| 156 | 1. The Pike VM (supports captures). |
| 157 | 2. Bounded backtracking (supports captures). |
| 158 | 3. Literal substring or multi-substring search. |
| 159 | 4. Lazy DFA (no support for Unicode word boundary assertions). |
| 160 | |
| 161 | Only the first two matching engines are capable of executing every regular |
| 162 | expression program. They also happen to be the slowest, which means we need |
| 163 | some logic that (1) knows various facts about the regular expression and (2) |
| 164 | knows what the caller wants. Using this information, we can determine which |
| 165 | engine (or engines) to use. |
| 166 | |
| 167 | The logic for choosing which engine to execute is in src/exec.rs and is |
| 168 | documented on the Exec type. Exec values contain regular expression Programs |
| 169 | (defined in src/prog.rs), which contain all the necessary tidbits for actually |
| 170 | executing a regular expression on search text. |
| 171 | |
| 172 | For the most part, the execution logic is straight-forward and follows the |
| 173 | limitations of each engine described above pretty faithfully. The hairiest |
| 174 | part of src/exec.rs by far is the execution of the lazy DFA, since it requires |
| 175 | a forwards and backwards search, and then falls back to either the Pike VM or |
| 176 | backtracking if the caller requested capture locations. |
| 177 | |
| 178 | The Exec type also contains mutable scratch space for each type of matching |
| 179 | engine. This scratch space is used during search (for example, for the lazy |
| 180 | DFA, it contains compiled states that are reused on subsequent searches). |
| 181 | |
| 182 | ### Programs |
| 183 | |
| 184 | A regular expression program is essentially a sequence of opcodes produced by |
| 185 | the compiler plus various facts about the regular expression (such as whether |
| 186 | it is anchored, its capture names, etc.). |
| 187 | |
| 188 | ### The regex! macro |
| 189 | |
| 190 | The `regex!` macro no longer exists. It was developed in a bygone era as a |
| 191 | compiler plugin during the infancy of the regex crate. Back then, then only |
| 192 | matching engine in the crate was the Pike VM. The `regex!` macro was, itself, |
| 193 | also a Pike VM. The only advantages it offered over the dynamic Pike VM that |
| 194 | was built at runtime were the following: |
| 195 | |
| 196 | 1. Syntax checking was done at compile time. Your Rust program wouldn't |
| 197 | compile if your regex didn't compile. |
| 198 | 2. Reduction of overhead that was proportional to the size of the regex. |
| 199 | For the most part, this overhead consisted of heap allocation, which |
| 200 | was nearly eliminated in the compiler plugin. |
| 201 | |
| 202 | The main takeaway here is that the compiler plugin was a marginally faster |
| 203 | version of a slow regex engine. As the regex crate evolved, it grew other regex |
| 204 | engines (DFA, bounded backtracker) and sophisticated literal optimizations. |
| 205 | The regex macro didn't keep pace, and it therefore became (dramatically) slower |
| 206 | than the dynamic engines. The only reason left to use it was for the compile |
| 207 | time guarantee that your regex is correct. Fortunately, Clippy (the Rust lint |
| 208 | tool) has a lint that checks your regular expression validity, which mostly |
| 209 | replaces that use case. |
| 210 | |
| 211 | Additionally, the regex compiler plugin stopped receiving maintenance. Nobody |
| 212 | complained. At that point, it seemed prudent to just remove it. |
| 213 | |
| 214 | Will a compiler plugin be brought back? The future is murky, but there is |
| 215 | definitely an opportunity there to build something that is faster than the |
| 216 | dynamic engines in some cases. But it will be challenging! As of now, there |
| 217 | are no plans to work on this. |
| 218 | |
| 219 | |
| 220 | ## Testing |
| 221 | |
| 222 | A key aspect of any mature regex library is its test suite. A subset of the |
| 223 | tests in this library come from Glenn Fowler's AT&T test suite (its online |
| 224 | presence seems gone at the time of writing). The source of the test suite is |
| 225 | located in src/testdata. The scripts/regex-match-tests.py takes the test suite |
| 226 | in src/testdata and generates tests/matches.rs. |
| 227 | |
| 228 | There are also many other manually crafted tests and regression tests in |
| 229 | tests/tests.rs. Some of these tests were taken from RE2. |
| 230 | |
| 231 | The biggest source of complexity in the tests is related to answering this |
| 232 | question: how can we reuse the tests to check all of our matching engines? One |
| 233 | approach would have been to encode every test into some kind of format (like |
| 234 | the AT&T test suite) and code generate tests for each matching engine. The |
| 235 | approach we use in this library is to create a Cargo.toml entry point for each |
| 236 | matching engine we want to test. The entry points are: |
| 237 | |
| 238 | * `tests/test_default.rs` - tests `Regex::new` |
| 239 | * `tests/test_default_bytes.rs` - tests `bytes::Regex::new` |
| 240 | * `tests/test_nfa.rs` - tests `Regex::new`, forced to use the NFA |
| 241 | algorithm on every regex. |
| 242 | * `tests/test_nfa_bytes.rs` - tests `Regex::new`, forced to use the NFA |
| 243 | algorithm on every regex and use *arbitrary* byte based programs. |
| 244 | * `tests/test_nfa_utf8bytes.rs` - tests `Regex::new`, forced to use the NFA |
| 245 | algorithm on every regex and use *UTF-8* byte based programs. |
| 246 | * `tests/test_backtrack.rs` - tests `Regex::new`, forced to use |
| 247 | backtracking on every regex. |
| 248 | * `tests/test_backtrack_bytes.rs` - tests `Regex::new`, forced to use |
| 249 | backtracking on every regex and use *arbitrary* byte based programs. |
| 250 | * `tests/test_backtrack_utf8bytes.rs` - tests `Regex::new`, forced to use |
| 251 | backtracking on every regex and use *UTF-8* byte based programs. |
| 252 | * `tests/test_crates_regex.rs` - tests to make sure that all of the |
| 253 | backends behave in the same way against a number of quickcheck |
| 254 | generated random inputs. These tests need to be enabled through |
| 255 | the `RUST_REGEX_RANDOM_TEST` environment variable (see |
| 256 | below). |
| 257 | |
| 258 | The lazy DFA and pure literal engines are absent from this list because |
| 259 | they cannot be used on every regular expression. Instead, we rely on |
| 260 | `tests/test_dynamic.rs` to test the lazy DFA and literal engines when possible. |
| 261 | |
| 262 | Since the tests are repeated several times, and because `cargo test` runs all |
| 263 | entry points, it can take a while to compile everything. To reduce compile |
| 264 | times slightly, try using `cargo test --test default`, which will only use the |
| 265 | `tests/test_default.rs` entry point. |
| 266 | |
| 267 | The random testing takes quite a while, so it is not enabled by default. |
| 268 | In order to run the random testing you can set the |
| 269 | `RUST_REGEX_RANDOM_TEST` environment variable to anything before |
| 270 | invoking `cargo test`. Note that this variable is inspected at compile |
| 271 | time, so if the tests don't seem to be running, you may need to run |
| 272 | `cargo clean`. |
| 273 | |
| 274 | ## Benchmarking |
| 275 | |
| 276 | The benchmarking in this crate is made up of many micro-benchmarks. Currently, |
| 277 | there are two primary sets of benchmarks: the benchmarks that were adopted |
| 278 | at this library's inception (in `bench/src/misc.rs`) and a newer set of |
| 279 | benchmarks meant to test various optimizations. Specifically, the latter set |
| 280 | contain some analysis and are in `bench/src/sherlock.rs`. Also, the latter |
| 281 | set are all executed on the same lengthy input whereas the former benchmarks |
| 282 | are executed on strings of varying length. |
| 283 | |
| 284 | There is also a smattering of benchmarks for parsing and compilation. |
| 285 | |
| 286 | Benchmarks are in a separate crate so that its dependencies can be managed |
| 287 | separately from the main regex crate. |
| 288 | |
| 289 | Benchmarking follows a similarly wonky setup as tests. There are multiple entry |
| 290 | points: |
| 291 | |
| 292 | * `bench_rust.rs` - benchmarks `Regex::new` |
| 293 | * `bench_rust_bytes.rs` benchmarks `bytes::Regex::new` |
| 294 | * `bench_pcre.rs` - benchmarks PCRE |
| 295 | * `bench_onig.rs` - benchmarks Oniguruma |
| 296 | |
| 297 | The PCRE and Oniguruma benchmarks exist as a comparison point to a mature |
| 298 | regular expression library. In general, this regex library compares favorably |
| 299 | (there are even a few benchmarks that PCRE simply runs too slowly on or |
| 300 | outright can't execute at all). I would love to add other regular expression |
| 301 | library benchmarks (especially RE2). |
| 302 | |
| 303 | If you're hacking on one of the matching engines and just want to see |
| 304 | benchmarks, then all you need to run is: |
| 305 | |
| 306 | $ (cd bench && ./run rust) |
| 307 | |
| 308 | If you want to compare your results with older benchmarks, then try: |
| 309 | |
| 310 | $ (cd bench && ./run rust | tee old) |
| 311 | $ ... make it faster |
| 312 | $ (cd bench && ./run rust | tee new) |
| 313 | $ cargo benchcmp old new --improvements |
| 314 | |
| 315 | The `cargo-benchcmp` utility is available here: |
| 316 | https://github.com/BurntSushi/cargo-benchcmp |
| 317 | |
| 318 | The `./bench/run` utility can run benchmarks for PCRE and Oniguruma too. See |
| 319 | `./bench/bench --help`. |
| 320 | |
| 321 | ## Dev Docs |
| 322 | |
| 323 | When digging your teeth into the codebase for the first time, the |
| 324 | crate documentation can be a great resource. By default `rustdoc` |
| 325 | will strip out all documentation of private crate members in an |
| 326 | effort to help consumers of the crate focus on the *interface* |
| 327 | without having to concern themselves with the *implementation*. |
| 328 | Normally this is a great thing, but if you want to start hacking |
| 329 | on regex internals it is not what you want. Many of the private members |
| 330 | of this crate are well documented with rustdoc style comments, and |
| 331 | it would be a shame to miss out on the opportunity that presents. |
| 332 | You can generate the private docs with: |
| 333 | |
| 334 | ``` |
| 335 | $ rustdoc --crate-name docs src/lib.rs -o target/doc -L target/debug/deps --no-defaults --passes collapse-docs --passes unindent-comments |
| 336 | ``` |
| 337 | |
| 338 | Then just point your browser at `target/doc/regex/index.html`. |
| 339 | |
| 340 | See https://github.com/rust-lang/rust/issues/15347 for more info |
| 341 | about generating developer docs for internal use. |