Chih-Hung Hsieh | 048fc04 | 2020-04-16 10:44:22 -0700 | [diff] [blame] | 1 | regex-syntax |
| 2 | ============ |
| 3 | This crate provides a robust regular expression parser. |
| 4 | |
Jeff Vander Stoep | 7d0d048 | 2022-12-13 15:18:55 +0100 | [diff] [blame] | 5 | [](https://github.com/rust-lang/regex/actions) |
| 6 | [](https://crates.io/crates/regex-syntax) |
Chih-Hung Hsieh | 048fc04 | 2020-04-16 10:44:22 -0700 | [diff] [blame] | 7 | [](https://github.com/rust-lang/regex) |
| 8 | |
| 9 | |
| 10 | ### Documentation |
| 11 | |
| 12 | https://docs.rs/regex-syntax |
| 13 | |
| 14 | |
| 15 | ### Overview |
| 16 | |
| 17 | There are two primary types exported by this crate: `Ast` and `Hir`. The former |
| 18 | is a faithful abstract syntax of a regular expression, and can convert regular |
| 19 | expressions back to their concrete syntax while mostly preserving its original |
| 20 | form. The latter type is a high level intermediate representation of a regular |
| 21 | expression that is amenable to analysis and compilation into byte codes or |
| 22 | automata. An `Hir` achieves this by drastically simplifying the syntactic |
| 23 | structure of the regular expression. While an `Hir` can be converted back to |
| 24 | its equivalent concrete syntax, the result is unlikely to resemble the original |
| 25 | concrete syntax that produced the `Hir`. |
| 26 | |
| 27 | |
| 28 | ### Example |
| 29 | |
| 30 | This example shows how to parse a pattern string into its HIR: |
| 31 | |
| 32 | ```rust |
| 33 | use regex_syntax::Parser; |
| 34 | use regex_syntax::hir::{self, Hir}; |
| 35 | |
| 36 | let hir = Parser::new().parse("a|b").unwrap(); |
| 37 | assert_eq!(hir, Hir::alternation(vec![ |
| 38 | Hir::literal(hir::Literal::Unicode('a')), |
| 39 | Hir::literal(hir::Literal::Unicode('b')), |
| 40 | ])); |
| 41 | ``` |
| 42 | |
| 43 | |
| 44 | ### Safety |
| 45 | |
| 46 | This crate has no `unsafe` code and sets `forbid(unsafe_code)`. While it's |
| 47 | possible this crate could use `unsafe` code in the future, the standard |
| 48 | for doing so is extremely high. In general, most code in this crate is not |
| 49 | performance critical, since it tends to be dwarfed by the time it takes to |
| 50 | compile a regular expression into an automaton. Therefore, there is little need |
| 51 | for extreme optimization, and therefore, use of `unsafe`. |
| 52 | |
| 53 | The standard for using `unsafe` in this crate is extremely high because this |
| 54 | crate is intended to be reasonably safe to use with user supplied regular |
Jeff Vander Stoep | 7d0d048 | 2022-12-13 15:18:55 +0100 | [diff] [blame] | 55 | expressions. Therefore, while there may be bugs in the regex parser itself, |
Chih-Hung Hsieh | 048fc04 | 2020-04-16 10:44:22 -0700 | [diff] [blame] | 56 | they should _never_ result in memory unsafety unless there is either a bug |
| 57 | in the compiler or the standard library. (Since `regex-syntax` has zero |
| 58 | dependencies.) |
| 59 | |
| 60 | |
| 61 | ### Crate features |
| 62 | |
| 63 | By default, this crate bundles a fairly large amount of Unicode data tables |
| 64 | (a source size of ~750KB). Because of their large size, one can disable some |
| 65 | or all of these data tables. If a regular expression attempts to use Unicode |
| 66 | data that is not available, then an error will occur when translating the `Ast` |
| 67 | to the `Hir`. |
| 68 | |
| 69 | The full set of features one can disable are |
| 70 | [in the "Crate features" section of the documentation](https://docs.rs/regex-syntax/*/#crate-features). |
| 71 | |
| 72 | |
| 73 | ### Testing |
| 74 | |
| 75 | Simply running `cargo test` will give you very good coverage. However, because |
| 76 | of the large number of features exposed by this crate, a `test` script is |
| 77 | included in this directory which will test several feature combinations. This |
| 78 | is the same script that is run in CI. |
| 79 | |
| 80 | |
| 81 | ### Motivation |
| 82 | |
| 83 | The primary purpose of this crate is to provide the parser used by `regex`. |
| 84 | Specifically, this crate is treated as an implementation detail of the `regex`, |
| 85 | and is primarily developed for the needs of `regex`. |
| 86 | |
| 87 | Since this crate is an implementation detail of `regex`, it may experience |
| 88 | breaking change releases at a different cadence from `regex`. This is only |
| 89 | possible because this crate is _not_ a public dependency of `regex`. |
| 90 | |
| 91 | Another consequence of this de-coupling is that there is no direct way to |
| 92 | compile a `regex::Regex` from a `regex_syntax::hir::Hir`. Instead, one must |
| 93 | first convert the `Hir` to a string (via its `std::fmt::Display`) and then |
| 94 | compile that via `Regex::new`. While this does repeat some work, compilation |
| 95 | typically takes much longer than parsing. |
| 96 | |
| 97 | Stated differently, the coupling between `regex` and `regex-syntax` exists only |
| 98 | at the level of the concrete syntax. |