vendor/regex-automata-0.2.0/PLANS.md - toolchain/rustc - Git at Google

 pattern_limit should not be defined inside nfa::thompson, but rather at the
 top-level.

 -----

 Main problem right now is exemplified by the set60 and set70 failing tests. In
 particular, when finding the starting position while matching multiple regexes
 simultaneously, the reverse search is messed up. The reverse search doesn't
 depend on which regex matched in the forward direction, which means it won't
 always find the correcting starting location. Unfortunately, the only way to
 fix this, as far as I can tell, is to add a group of start states for every
 regex in the DFA. Then once we do the reverse search, we need to choose the
 correct start state based on which regex matched in the forward direction.

 This is a nasty change.

 So it looks like this only applies when doing an overlapping search in reverse
 to find the start of a match. That means we should make this configurable
 but enable it by default for the reverse automata. It should be configurable
 so that folks can construct a regex that doesn't have the ability to do
 overlapping searches correctly. If an overlapping search is attempted with
 a reverse automaton that lacks starting states for each pattern, then the
 implementation should panic.

 BUT! It is also convenient to provide this option in general for folks that
 want a DFA that can match any pattern while also being able to match a specific
 pattern.

 Straw man:

 * Update dense::Config to have a `starts_for_each_pattern` option. It should
   be disabled by default.
 * In `RegexBuilder::build_many_with_size` tweak the reverse DFA configuration
   to have the aforementioned option enabled.
 * It would be interesting to add new APIs to `Regex` that support matching
   specific patterns, but I think this is a complication. If we did want to do
   this, then we should just add it to the `_at` variants and leave the rest of
   the API untouched.
 * Add a `pattern_id: Option<PatternID>` parameter to each of the five
   `*_at` methods on the `dfa::Automaton` trait. A value of `None` retains the
   existing behavior. A `Some` value means that the starting state for that
   specific pattern must be chosen, which in turn implies an anchored search.
   (This means `starts_for_each_pattern` has utility for single-pattern DFAs
   since it makes it possible to build a DFA that can do both unanchored and
   anchored searches.)
 * Thread this new parameter down into the various functions in `dfa::search`
   all the way down into `init_fwd` and `init_rev`. These functions will then
   pass it to `dfa.start_state_{forward,reverse}`.
 * This is where things get gruesome since we now need to completely re-work how
   start states are represented in dense and sparse DFAs _and_ it needs to be
   configurable. It looks like the `Start` type from `dfa::automaton` can
   basically remain unchanged, since it still represents one of the four
   possible starting states that will need to be applied for every pattern.
 * For `dfa::dense`, change `StartList` to `StartTable`. Currently, its only
   header is the state ID count, which is always 4. We'll want to change this
   to the stride and add a new header value that encodes the number of patterns.
   When the number of patterns is zero, then existing behavior is preserved and
   represents the case where `starts_for_each_pattern` is disabled (or in the
   case of an empty DFA). When non-zero, a table of starting state IDs is
   encoded with each row corresponding to the 4 starting states for each
   pattern. Before this table (even if it's empty), the 4 starting states for
   the entire DFA are encoded.
 * For `dfa::sparse`, do the same as above. They are essentially the same right
   now anyway, with the only difference being that sparse DFAs use `&[u8]`
   instead of `&[S]` (because sparse DFAs don't have any alignment
   requirements).
 * Modify `DFA::empty` to accept a `starts_for_each_pattern` bool that, when
   true, creates a start table with the header, the start states for the entire
   DFA and a row of start states for each pattern. When false, no rows are
   added.
 * Expose whether there are starting states for each pattern via a predicate
   on the DFA.
 * Modify the determinizer's `add_starts` method to basically do what it does,
   but also do it for each pattern when the DFA is configured for it. It should
   continue to reuse states as appropriate or not generate new states if they
   aren't needed. This will want to use the `NFA::start_pattern` method, which
   provides the starting NFA state ID for the given pattern.
 * Fix the dense->sparse conversion. At this point, this piece should be fairly
   straight-forward since the sparse representation of starting states is
   basically identical to the dense representation.

 At this point, I think the bug should resolve itself.

 ^^^^ DONE! IT WORKS!

 -----


 Add top-level SyntaxConfig (or some such) that has all of the regex-syntax
 options forwarded, but with automata oriented docs. Then use this for all of
 the engines instead of having to repeat every option for every builder.

 -----

 These produce different results. PCRE2 looks correct. Basically, we should be
 using the context around the `at` position correctly, which we aren't doing
 right now. Seems tricky to get right, particularly when confirming the match
 with a reverse DFA.

 Maybe our 'at' functions need to take a full range... Sigh. This is indeed what
 RE2 does. GAH.

 fn main() {
     let re = regex::Regex::new(r"(?-u)\b\sbar").unwrap();
     let s = "foo bar baz";
     println!("{:?}", re.find_at(s, 3).map(|m| m.as_str()));

     let re = pcre2::bytes::Regex::new(r"\b\sbar").unwrap();
     let s = "foo bar baz";
     println!("{:?}", re.find_at(s.as_bytes(), 3).unwrap());
 }

 ^^^^ This is fixed now, but we still need to find a way to add test coverage
 for "context" searches. It'd be nice to do this automatically, but we'll
 probably just added a new 'context = [start, end]' option.

 -----


 * Create regex-test crate, based on glob-test. Try to anticipate the needs for
   the full regex test suite.
   * See if we can clean up tests.
     * Provide a way to mark a test as expensive.
     * Provide a way to test is_match_at and find_at.
     * Test shortest_match_at too? Huge pain. Add tests for it.
     * Port ALL tests from the regex crate. Will probably need a way to mark a
       test as skipped.
     * Document tests better.
 * Find a way to remove byteorder dependency.
 * Reorganize crate API:
   * Have errors contain `Box<Error+Send+Sync>` instead of `String`.
   * Make errors non-exhaustive.
   * Audit `StateID` trait for safety.
   * Brainstorm hard about `DFA` trait and the fact that DenseDFA and SparseDFA
     have inefficient implementations of some methods. Maybe use multiple
     traits? Answer: get rid of premultiply/classes knobs and just enable
     them by default. Should remove a huge amount of code.
   * Check whether `unsafe` is really needed to eliminate bounds checks. Use
     micro-benchmarks and bigger CLI workloads using `regex-automata-debug`.
   * Re-write module docs for `dfa` as they are no longer top-level. Keep most.
   * Retain any pertinent top-level crate docs, but don't rewrite yet.
   * Clean up builders if we can. e.g., Determinizer, minimizer, it's all a mess
     right now.
   * Clean up and add 'always_match' and 'never_match' constructors for every
     regex engine.
   * See about supporting ^, $, \A, \z, \b and \B in DFAs. Do the non-Unicode
     version of \b unfortunately. Carefully scrutinize how the regex crate's
     lazy DFA does it and try to make it comprehensible. Done! Except for the
     part about making it comprehensible.
 * Rethink prefilters?
 * Add `regex-automata-generate` CLI tool. This should just be a copy of
   the `ucd-generate dfa` and `ucd-generate regex` commands.

 Then build new public `nfa` sub-module.
   * For Unicode \b, generate \w DFA (forwards and reverse) and embed it into
     source for fast checking. That way, we don't need to ever do explicit UTF-8
     decoding anywhere. Yay.

 Then `lazy` sub-module.

 Then `onepass`.

 Then `jit`.

 ... and beyond? CRAZY. But it can be done! Build STRONG base layers.
	pattern_limit should not be defined inside nfa::thompson, but rather at the
	top-level.

	-----

	Main problem right now is exemplified by the set60 and set70 failing tests. In
	particular, when finding the starting position while matching multiple regexes
	simultaneously, the reverse search is messed up. The reverse search doesn't
	depend on which regex matched in the forward direction, which means it won't
	always find the correcting starting location. Unfortunately, the only way to
	fix this, as far as I can tell, is to add a group of start states for every
	regex in the DFA. Then once we do the reverse search, we need to choose the
	correct start state based on which regex matched in the forward direction.

	This is a nasty change.

	So it looks like this only applies when doing an overlapping search in reverse
	to find the start of a match. That means we should make this configurable
	but enable it by default for the reverse automata. It should be configurable
	so that folks can construct a regex that doesn't have the ability to do
	overlapping searches correctly. If an overlapping search is attempted with
	a reverse automaton that lacks starting states for each pattern, then the
	implementation should panic.

	BUT! It is also convenient to provide this option in general for folks that
	want a DFA that can match any pattern while also being able to match a specific
	pattern.

	Straw man:

	* Update dense::Config to have a `starts_for_each_pattern` option. It should
	be disabled by default.
	* In `RegexBuilder::build_many_with_size` tweak the reverse DFA configuration
	to have the aforementioned option enabled.
	* It would be interesting to add new APIs to `Regex` that support matching
	specific patterns, but I think this is a complication. If we did want to do
	this, then we should just add it to the `_at` variants and leave the rest of
	the API untouched.
	* Add a `pattern_id: Option<PatternID>` parameter to each of the five
	`*_at` methods on the `dfa::Automaton` trait. A value of `None` retains the
	existing behavior. A `Some` value means that the starting state for that
	specific pattern must be chosen, which in turn implies an anchored search.
	(This means `starts_for_each_pattern` has utility for single-pattern DFAs
	since it makes it possible to build a DFA that can do both unanchored and
	anchored searches.)
	* Thread this new parameter down into the various functions in `dfa::search`
	all the way down into `init_fwd` and `init_rev`. These functions will then
	pass it to `dfa.start_state_{forward,reverse}`.
	* This is where things get gruesome since we now need to completely re-work how
	start states are represented in dense and sparse DFAs _and_ it needs to be
	configurable. It looks like the `Start` type from `dfa::automaton` can
	basically remain unchanged, since it still represents one of the four
	possible starting states that will need to be applied for every pattern.
	* For `dfa::dense`, change `StartList` to `StartTable`. Currently, its only
	header is the state ID count, which is always 4. We'll want to change this
	to the stride and add a new header value that encodes the number of patterns.
	When the number of patterns is zero, then existing behavior is preserved and
	represents the case where `starts_for_each_pattern` is disabled (or in the
	case of an empty DFA). When non-zero, a table of starting state IDs is
	encoded with each row corresponding to the 4 starting states for each
	pattern. Before this table (even if it's empty), the 4 starting states for
	the entire DFA are encoded.
	* For `dfa::sparse`, do the same as above. They are essentially the same right
	now anyway, with the only difference being that sparse DFAs use `&[u8]`
	instead of `&[S]` (because sparse DFAs don't have any alignment
	requirements).
	* Modify `DFA::empty` to accept a `starts_for_each_pattern` bool that, when
	true, creates a start table with the header, the start states for the entire
	DFA and a row of start states for each pattern. When false, no rows are
	added.
	* Expose whether there are starting states for each pattern via a predicate
	on the DFA.
	* Modify the determinizer's `add_starts` method to basically do what it does,
	but also do it for each pattern when the DFA is configured for it. It should
	continue to reuse states as appropriate or not generate new states if they
	aren't needed. This will want to use the `NFA::start_pattern` method, which
	provides the starting NFA state ID for the given pattern.
	* Fix the dense->sparse conversion. At this point, this piece should be fairly
	straight-forward since the sparse representation of starting states is
	basically identical to the dense representation.

	At this point, I think the bug should resolve itself.

	^^^^ DONE! IT WORKS!

	-----


	Add top-level SyntaxConfig (or some such) that has all of the regex-syntax
	options forwarded, but with automata oriented docs. Then use this for all of
	the engines instead of having to repeat every option for every builder.

	-----

	These produce different results. PCRE2 looks correct. Basically, we should be
	using the context around the `at` position correctly, which we aren't doing
	right now. Seems tricky to get right, particularly when confirming the match
	with a reverse DFA.

	Maybe our 'at' functions need to take a full range... Sigh. This is indeed what
	RE2 does. GAH.

	fn main() {
	let re = regex::Regex::new(r"(?-u)\b\sbar").unwrap();
	let s = "foo bar baz";
	println!("{:?}", re.find_at(s, 3).map(\|m\| m.as_str()));

	let re = pcre2::bytes::Regex::new(r"\b\sbar").unwrap();
	let s = "foo bar baz";
	println!("{:?}", re.find_at(s.as_bytes(), 3).unwrap());
	}

	^^^^ This is fixed now, but we still need to find a way to add test coverage
	for "context" searches. It'd be nice to do this automatically, but we'll
	probably just added a new 'context = [start, end]' option.

	-----


	* Create regex-test crate, based on glob-test. Try to anticipate the needs for
	the full regex test suite.
	* See if we can clean up tests.
	* Provide a way to mark a test as expensive.
	* Provide a way to test is_match_at and find_at.
	* Test shortest_match_at too? Huge pain. Add tests for it.
	* Port ALL tests from the regex crate. Will probably need a way to mark a
	test as skipped.
	* Document tests better.
	* Find a way to remove byteorder dependency.
	* Reorganize crate API:
	* Have errors contain `Box<Error+Send+Sync>` instead of `String`.
	* Make errors non-exhaustive.
	* Audit `StateID` trait for safety.
	* Brainstorm hard about `DFA` trait and the fact that DenseDFA and SparseDFA
	have inefficient implementations of some methods. Maybe use multiple
	traits? Answer: get rid of premultiply/classes knobs and just enable
	them by default. Should remove a huge amount of code.
	* Check whether `unsafe` is really needed to eliminate bounds checks. Use
	micro-benchmarks and bigger CLI workloads using `regex-automata-debug`.
	* Re-write module docs for `dfa` as they are no longer top-level. Keep most.
	* Retain any pertinent top-level crate docs, but don't rewrite yet.
	* Clean up builders if we can. e.g., Determinizer, minimizer, it's all a mess
	right now.
	* Clean up and add 'always_match' and 'never_match' constructors for every
	regex engine.
	* See about supporting ^, $, \A, \z, \b and \B in DFAs. Do the non-Unicode
	version of \b unfortunately. Carefully scrutinize how the regex crate's
	lazy DFA does it and try to make it comprehensible. Done! Except for the
	part about making it comprehensible.
	* Rethink prefilters?
	* Add `regex-automata-generate` CLI tool. This should just be a copy of
	the `ucd-generate dfa` and `ucd-generate regex` commands.

	Then build new public `nfa` sub-module.
	* For Unicode \b, generate \w DFA (forwards and reverse) and embed it into
	source for fast checking. That way, we don't need to ever do explicit UTF-8
	decoding anywhere. Yay.

	Then `lazy` sub-module.

	Then `onepass`.

	Then `jit`.

	... and beyond? CRAZY. But it can be done! Build STRONG base layers.