Jakub Kotur | 3bceaeb | 2020-12-21 17:28:16 +0100 | [diff] [blame] | 1 | regex-automata |
| 2 | ============== |
| 3 | A low level regular expression library that uses deterministic finite automata. |
| 4 | It supports a rich syntax with Unicode support, has extensive options for |
| 5 | configuring the best space vs time trade off for your use case and provides |
| 6 | support for cheap deserialization of automata for use in `no_std` environments. |
| 7 | |
| 8 | [](https://github.com/BurntSushi/regex-automata/actions) |
Joel Galenson | 8249a3d | 2021-06-21 14:01:00 -0700 | [diff] [blame] | 9 | [](https://crates.io/crates/regex-automata) |
| 10 |  |
Jakub Kotur | 3bceaeb | 2020-12-21 17:28:16 +0100 | [diff] [blame] | 11 | |
Joel Galenson | 8249a3d | 2021-06-21 14:01:00 -0700 | [diff] [blame] | 12 | Dual-licensed under MIT or the [UNLICENSE](https://unlicense.org/). |
Jakub Kotur | 3bceaeb | 2020-12-21 17:28:16 +0100 | [diff] [blame] | 13 | |
| 14 | |
| 15 | ### Documentation |
| 16 | |
| 17 | https://docs.rs/regex-automata |
| 18 | |
| 19 | |
| 20 | ### Usage |
| 21 | |
| 22 | Add this to your `Cargo.toml`: |
| 23 | |
| 24 | ```toml |
| 25 | [dependencies] |
| 26 | regex-automata = "0.1" |
| 27 | ``` |
| 28 | |
| 29 | and this to your crate root (if you're using Rust 2015): |
| 30 | |
| 31 | ```rust |
| 32 | extern crate regex_automata; |
| 33 | ``` |
| 34 | |
| 35 | |
| 36 | ### Example: basic regex searching |
| 37 | |
| 38 | This example shows how to compile a regex using the default configuration |
| 39 | and then use it to find matches in a byte string: |
| 40 | |
| 41 | ```rust |
| 42 | use regex_automata::Regex; |
| 43 | |
| 44 | let re = Regex::new(r"[0-9]{4}-[0-9]{2}-[0-9]{2}").unwrap(); |
| 45 | let text = b"2018-12-24 2016-10-08"; |
| 46 | let matches: Vec<(usize, usize)> = re.find_iter(text).collect(); |
| 47 | assert_eq!(matches, vec![(0, 10), (11, 21)]); |
| 48 | ``` |
| 49 | |
| 50 | For more examples and information about the various knobs that can be turned, |
| 51 | please see the [docs](https://docs.rs/regex-automata). |
| 52 | |
| 53 | |
| 54 | ### Support for `no_std` |
| 55 | |
| 56 | This crate comes with a `std` feature that is enabled by default. When the |
| 57 | `std` feature is enabled, the API of this crate will include the facilities |
| 58 | necessary for compiling, serializing, deserializing and searching with regular |
| 59 | expressions. When the `std` feature is disabled, the API of this crate will |
| 60 | shrink such that it only includes the facilities necessary for deserializing |
| 61 | and searching with regular expressions. |
| 62 | |
| 63 | The intended workflow for `no_std` environments is thus as follows: |
| 64 | |
| 65 | * Write a program with the `std` feature that compiles and serializes a |
| 66 | regular expression. Serialization should only happen after first converting |
| 67 | the DFAs to use a fixed size state identifier instead of the default `usize`. |
| 68 | You may also need to serialize both little and big endian versions of each |
| 69 | DFA. (So that's 4 DFAs in total for each regex.) |
| 70 | * In your `no_std` environment, follow the examples above for deserializing |
| 71 | your previously serialized DFAs into regexes. You can then search with them |
| 72 | as you would any regex. |
| 73 | |
| 74 | Deserialization can happen anywhere. For example, with bytes embedded into a |
| 75 | binary or with a file memory mapped at runtime. |
| 76 | |
| 77 | Note that the |
| 78 | [`ucd-generate`](https://github.com/BurntSushi/ucd-generate) |
| 79 | tool will do the first step for you with its `dfa` or `regex` sub-commands. |
| 80 | |
| 81 | |
| 82 | ### Cargo features |
| 83 | |
| 84 | * `std` - **Enabled** by default. This enables the ability to compile finite |
| 85 | automata. This requires the `regex-syntax` dependency. Without this feature |
| 86 | enabled, finite automata can only be used for searching (using the approach |
| 87 | described above). |
| 88 | * `transducer` - **Disabled** by default. This provides implementations of the |
| 89 | `Automaton` trait found in the `fst` crate. This permits using finite |
| 90 | automata generated by this crate to search finite state transducers. This |
| 91 | requires the `fst` dependency. |
| 92 | |
| 93 | |
| 94 | ### Differences with the regex crate |
| 95 | |
| 96 | The main goal of the [`regex`](https://docs.rs/regex) crate is to serve as a |
| 97 | general purpose regular expression engine. It aims to automatically balance low |
| 98 | compile times, fast search times and low memory usage, while also providing |
| 99 | a convenient API for users. In contrast, this crate provides a lower level |
| 100 | regular expression interface that is a bit less convenient while providing more |
| 101 | explicit control over memory usage and search times. |
| 102 | |
| 103 | Here are some specific negative differences: |
| 104 | |
| 105 | * **Compilation can take an exponential amount of time and space** in the size |
| 106 | of the regex pattern. While most patterns do not exhibit worst case |
| 107 | exponential time, such patterns do exist. For example, `[01]*1[01]{N}` will |
| 108 | build a DFA with `2^(N+1)` states. For this reason, untrusted patterns should |
| 109 | not be compiled with this library. (In the future, the API may expose an |
| 110 | option to return an error if the DFA gets too big.) |
| 111 | * This crate does not support sub-match extraction, which can be achieved with |
| 112 | the regex crate's "captures" API. This may be added in the future, but is |
| 113 | unlikely. |
| 114 | * While the regex crate doesn't necessarily sport fast compilation times, the |
| 115 | regexes in this crate are almost universally slow to compile, especially when |
| 116 | they contain large Unicode character classes. For example, on my system, |
| 117 | compiling `\w{3}` with byte classes enabled takes just over 1 second and |
| 118 | almost 5MB of memory! (Compiling a sparse regex takes about the same time |
| 119 | but only uses about 500KB of memory.) Conversly, compiling the same regex |
| 120 | without Unicode support, e.g., `(?-u)\w{3}`, takes under 1 millisecond and |
| 121 | less than 5KB of memory. For this reason, you should only use Unicode |
| 122 | character classes if you absolutely need them! |
| 123 | * This crate does not support regex sets. |
| 124 | * This crate does not support zero-width assertions such as `^`, `$`, `\b` or |
| 125 | `\B`. |
| 126 | * As a lower level crate, this library does not do literal optimizations. In |
| 127 | exchange, you get predictable performance regardless of input. The |
| 128 | philosophy here is that literal optimizations should be applied at a higher |
| 129 | level, although there is no easy support for this in the ecosystem yet. |
| 130 | * There is no `&str` API like in the regex crate. In this crate, all APIs |
| 131 | operate on `&[u8]`. By default, match indices are guaranteed to fall on |
| 132 | UTF-8 boundaries, unless `RegexBuilder::allow_invalid_utf8` is enabled. |
| 133 | |
| 134 | With some of the downsides out of the way, here are some positive differences: |
| 135 | |
| 136 | * Both dense and sparse DFAs can be serialized to raw bytes, and then cheaply |
| 137 | deserialized. Deserialization always takes constant time since searching can |
| 138 | be performed directly on the raw serialized bytes of a DFA. |
| 139 | * This crate was specifically designed so that the searching phase of a DFA has |
| 140 | minimal runtime requirements, and can therefore be used in `no_std` |
| 141 | environments. While `no_std` environments cannot compile regexes, they can |
| 142 | deserialize pre-compiled regexes. |
| 143 | * Since this crate builds DFAs ahead of time, it will generally out-perform |
| 144 | the `regex` crate on equivalent tasks. The performance difference is likely |
| 145 | not large. However, because of a complex set of optimizations in the regex |
| 146 | crate (like literal optimizations), an accurate performance comparison may be |
| 147 | difficult to do. |
| 148 | * Sparse DFAs provide a way to build a DFA ahead of time that sacrifices search |
| 149 | performance a small amount, but uses much less storage space. Potentially |
| 150 | even less than what the regex crate uses. |
| 151 | * This crate exposes DFAs directly, such as `DenseDFA` and `SparseDFA`, |
| 152 | which enables one to do less work in some cases. For example, if you only |
| 153 | need the end of a match and not the start of a match, then you can use a DFA |
| 154 | directly without building a `Regex`, which always requires a second DFA to |
| 155 | find the start of a match. |
| 156 | * Aside from choosing between dense and sparse DFAs, there are several options |
| 157 | for configuring the space usage vs search time trade off. These include |
| 158 | things like choosing a smaller state identifier representation, to |
| 159 | premultiplying state identifiers and splitting a DFA's alphabet into |
| 160 | equivalence classes. Finally, DFA minimization is also provided, but can |
| 161 | increase compilation times dramatically. |
| 162 | |
| 163 | |
| 164 | ### Future work |
| 165 | |
| 166 | * Look into being smarter about generating NFA states for large Unicode |
| 167 | character classes. These can create a lot of additional work for both the |
| 168 | determinizer and the minimizer, and I suspect this is the key thing we'll |
| 169 | want to improve if we want to make DFA compile times faster. I *believe* |
| 170 | it's possible to potentially build minimal or nearly minimal NFAs for the |
| 171 | special case of Unicode character classes by leveraging Daciuk's algorithms |
| 172 | for building minimal automata in linear time for sets of strings. See |
| 173 | https://blog.burntsushi.net/transducers/#construction for more details. The |
| 174 | key adaptation I think we need to make is to modify the algorithm to operate |
| 175 | on byte ranges instead of enumerating every codepoint in the set. Otherwise, |
| 176 | it might not be worth doing. |
| 177 | * Add support for regex sets. It should be possible to do this by "simply" |
| 178 | introducing more match states. I think we can also report the positions at |
| 179 | each match, similar to how Aho-Corasick works. I think the long pole in the |
| 180 | tent here is probably the API design work and arranging it so that we don't |
| 181 | introduce extra overhead into the non-regex-set case without duplicating a |
| 182 | lot of code. It seems doable. |
| 183 | * Stretch goal: support capturing groups by implementing "tagged" DFA |
| 184 | (transducers). Laurikari's paper is the usual reference here, but Trofimovich |
| 185 | has a much more thorough treatment here: |
Joel Galenson | 8249a3d | 2021-06-21 14:01:00 -0700 | [diff] [blame] | 186 | https://re2c.org/2017_trofimovich_tagged_deterministic_finite_automata_with_lookahead.pdf |
Jakub Kotur | 3bceaeb | 2020-12-21 17:28:16 +0100 | [diff] [blame] | 187 | I've only read the paper once. I suspect it will require at least a few more |
| 188 | read throughs before I understand it. |
Joel Galenson | 8249a3d | 2021-06-21 14:01:00 -0700 | [diff] [blame] | 189 | See also: https://re2c.org |
Jakub Kotur | 3bceaeb | 2020-12-21 17:28:16 +0100 | [diff] [blame] | 190 | * Possibly less ambitious goal: can we select a portion of Trofimovich's work |
| 191 | to make small fixed length look-around work? It would be really nice to |
| 192 | support ^, $ and \b, especially the Unicode variant of \b and CRLF aware $. |
| 193 | * Experiment with code generating Rust code. There is an early experiment in |
| 194 | src/codegen.rs that is thoroughly bit-rotted. At the time, I was |
| 195 | experimenting with whether or not codegen would significant decrease the size |
| 196 | of a DFA, since if you squint hard enough, it's kind of like a sparse |
| 197 | representation. However, it didn't shrink as much as I thought it would, so |
| 198 | I gave up. The other problem is that Rust doesn't support gotos, so I don't |
| 199 | even know whether the "match on each state" in a loop thing will be fast |
| 200 | enough. Either way, it's probably a good option to have. For one thing, it |
| 201 | would be endian independent where as the serialization format of the DFAs in |
| 202 | this crate are endian dependent (so you need two versions of every DFA, but |
| 203 | you only need to compile one of them for any given arch). |
| 204 | * Experiment with unrolling the match loops and fill out the benchmarks. |
| 205 | * Add some kind of streaming API. I believe users of the library can already |
| 206 | implement something for this outside of the crate, but it would be good to |
| 207 | provide an official API. The key thing here is figuring out the API. I |
| 208 | suspect we might want to support several variants. |
| 209 | * Make a decision on whether or not there is room for literal optimizations |
| 210 | in this crate. My original intent was to not let this crate sink down into |
| 211 | that very very very deep rabbit hole. But instead, we might want to provide |
| 212 | some way for literal optimizations to hook into the match routines. The right |
| 213 | path forward here is to probably build something outside of the crate and |
| 214 | then see about integrating it. After all, users can implement their own |
| 215 | match routines just as efficiently as what the crate provides. |
| 216 | * A key downside of DFAs is that they can take up a lot of memory and can be |
| 217 | quite costly to build. Their worst case compilation time is O(2^n), where |
| 218 | n is the number of NFA states. A paper by Yang and Prasanna (2011) actually |
| 219 | seems to provide a way to character state blow up such that it is detectable. |
| 220 | If we could know whether a regex will exhibit state explosion or not, then |
| 221 | we could make an intelligent decision about whether to ahead-of-time compile |
| 222 | a DFA. |
Joel Galenson | 8249a3d | 2021-06-21 14:01:00 -0700 | [diff] [blame] | 223 | See: https://www.researchgate.net/profile/Xu-Shutu/publication/229032602_Characterization_of_a_global_germplasm_collection_and_its_potential_utilization_for_analysis_of_complex_quantitative_traits_in_maize/links/02bfe50f914d04c837000000/Characterization-of-a-global-germplasm-collection-and-its-potential-utilization-for-analysis-of-complex-quantitative-traits-in-maize.pdf |