| pattern_limit should not be defined inside nfa::thompson, but rather at the |
| top-level. |
| |
| ----- |
| |
| Main problem right now is exemplified by the set60 and set70 failing tests. In |
| particular, when finding the starting position while matching multiple regexes |
| simultaneously, the reverse search is messed up. The reverse search doesn't |
| depend on which regex matched in the forward direction, which means it won't |
| always find the correcting starting location. Unfortunately, the only way to |
| fix this, as far as I can tell, is to add a group of start states for every |
| regex in the DFA. Then once we do the reverse search, we need to choose the |
| correct start state based on which regex matched in the forward direction. |
| |
| This is a nasty change. |
| |
| So it looks like this only applies when doing an overlapping search in reverse |
| to find the start of a match. That means we should make this configurable |
| but enable it by default for the reverse automata. It should be configurable |
| so that folks can construct a regex that doesn't have the ability to do |
| overlapping searches correctly. If an overlapping search is attempted with |
| a reverse automaton that lacks starting states for each pattern, then the |
| implementation should panic. |
| |
| BUT! It is also convenient to provide this option in general for folks that |
| want a DFA that can match any pattern while also being able to match a specific |
| pattern. |
| |
| Straw man: |
| |
| * Update dense::Config to have a `starts_for_each_pattern` option. It should |
| be disabled by default. |
| * In `RegexBuilder::build_many_with_size` tweak the reverse DFA configuration |
| to have the aforementioned option enabled. |
| * It would be interesting to add new APIs to `Regex` that support matching |
| specific patterns, but I think this is a complication. If we did want to do |
| this, then we should just add it to the `_at` variants and leave the rest of |
| the API untouched. |
| * Add a `pattern_id: Option<PatternID>` parameter to each of the five |
| `*_at` methods on the `dfa::Automaton` trait. A value of `None` retains the |
| existing behavior. A `Some` value means that the starting state for that |
| specific pattern must be chosen, which in turn implies an anchored search. |
| (This means `starts_for_each_pattern` has utility for single-pattern DFAs |
| since it makes it possible to build a DFA that can do both unanchored and |
| anchored searches.) |
| * Thread this new parameter down into the various functions in `dfa::search` |
| all the way down into `init_fwd` and `init_rev`. These functions will then |
| pass it to `dfa.start_state_{forward,reverse}`. |
| * This is where things get gruesome since we now need to completely re-work how |
| start states are represented in dense and sparse DFAs _and_ it needs to be |
| configurable. It looks like the `Start` type from `dfa::automaton` can |
| basically remain unchanged, since it still represents one of the four |
| possible starting states that will need to be applied for every pattern. |
| * For `dfa::dense`, change `StartList` to `StartTable`. Currently, its only |
| header is the state ID count, which is always 4. We'll want to change this |
| to the stride and add a new header value that encodes the number of patterns. |
| When the number of patterns is zero, then existing behavior is preserved and |
| represents the case where `starts_for_each_pattern` is disabled (or in the |
| case of an empty DFA). When non-zero, a table of starting state IDs is |
| encoded with each row corresponding to the 4 starting states for each |
| pattern. Before this table (even if it's empty), the 4 starting states for |
| the entire DFA are encoded. |
| * For `dfa::sparse`, do the same as above. They are essentially the same right |
| now anyway, with the only difference being that sparse DFAs use `&[u8]` |
| instead of `&[S]` (because sparse DFAs don't have any alignment |
| requirements). |
| * Modify `DFA::empty` to accept a `starts_for_each_pattern` bool that, when |
| true, creates a start table with the header, the start states for the entire |
| DFA and a row of start states for each pattern. When false, no rows are |
| added. |
| * Expose whether there are starting states for each pattern via a predicate |
| on the DFA. |
| * Modify the determinizer's `add_starts` method to basically do what it does, |
| but also do it for each pattern when the DFA is configured for it. It should |
| continue to reuse states as appropriate or not generate new states if they |
| aren't needed. This will want to use the `NFA::start_pattern` method, which |
| provides the starting NFA state ID for the given pattern. |
| * Fix the dense->sparse conversion. At this point, this piece should be fairly |
| straight-forward since the sparse representation of starting states is |
| basically identical to the dense representation. |
| |
| At this point, I think the bug should resolve itself. |
| |
| ^^^^ DONE! IT WORKS! |
| |
| ----- |
| |
| |
| Add top-level SyntaxConfig (or some such) that has all of the regex-syntax |
| options forwarded, but with automata oriented docs. Then use this for all of |
| the engines instead of having to repeat every option for every builder. |
| |
| ----- |
| |
| These produce different results. PCRE2 looks correct. Basically, we should be |
| using the context around the `at` position correctly, which we aren't doing |
| right now. Seems tricky to get right, particularly when confirming the match |
| with a reverse DFA. |
| |
| Maybe our 'at' functions need to take a full range... Sigh. This is indeed what |
| RE2 does. GAH. |
| |
| fn main() { |
| let re = regex::Regex::new(r"(?-u)\b\sbar").unwrap(); |
| let s = "foo bar baz"; |
| println!("{:?}", re.find_at(s, 3).map(|m| m.as_str())); |
| |
| let re = pcre2::bytes::Regex::new(r"\b\sbar").unwrap(); |
| let s = "foo bar baz"; |
| println!("{:?}", re.find_at(s.as_bytes(), 3).unwrap()); |
| } |
| |
| ^^^^ This is fixed now, but we still need to find a way to add test coverage |
| for "context" searches. It'd be nice to do this automatically, but we'll |
| probably just added a new 'context = [start, end]' option. |
| |
| ----- |
| |
| |
| * Create regex-test crate, based on glob-test. Try to anticipate the needs for |
| the full regex test suite. |
| * See if we can clean up tests. |
| * Provide a way to mark a test as expensive. |
| * Provide a way to test is_match_at and find_at. |
| * Test shortest_match_at too? Huge pain. Add tests for it. |
| * Port ALL tests from the regex crate. Will probably need a way to mark a |
| test as skipped. |
| * Document tests better. |
| * Find a way to remove byteorder dependency. |
| * Reorganize crate API: |
| * Have errors contain `Box<Error+Send+Sync>` instead of `String`. |
| * Make errors non-exhaustive. |
| * Audit `StateID` trait for safety. |
| * Brainstorm hard about `DFA` trait and the fact that DenseDFA and SparseDFA |
| have inefficient implementations of some methods. Maybe use multiple |
| traits? Answer: get rid of premultiply/classes knobs and just enable |
| them by default. Should remove a huge amount of code. |
| * Check whether `unsafe` is really needed to eliminate bounds checks. Use |
| micro-benchmarks and bigger CLI workloads using `regex-automata-debug`. |
| * Re-write module docs for `dfa` as they are no longer top-level. Keep most. |
| * Retain any pertinent top-level crate docs, but don't rewrite yet. |
| * Clean up builders if we can. e.g., Determinizer, minimizer, it's all a mess |
| right now. |
| * Clean up and add 'always_match' and 'never_match' constructors for every |
| regex engine. |
| * See about supporting ^, $, \A, \z, \b and \B in DFAs. Do the non-Unicode |
| version of \b unfortunately. Carefully scrutinize how the regex crate's |
| lazy DFA does it and try to make it comprehensible. Done! Except for the |
| part about making it comprehensible. |
| * Rethink prefilters? |
| * Add `regex-automata-generate` CLI tool. This should just be a copy of |
| the `ucd-generate dfa` and `ucd-generate regex` commands. |
| |
| Then build new public `nfa` sub-module. |
| * For Unicode \b, generate \w DFA (forwards and reverse) and embed it into |
| source for fast checking. That way, we don't need to ever do explicit UTF-8 |
| decoding anywhere. Yay. |
| |
| Then `lazy` sub-module. |
| |
| Then `onepass`. |
| |
| Then `jit`. |
| |
| ... and beyond? CRAZY. But it can be done! Build STRONG base layers. |