vendor/regex-0.1.80/PERFORMANCE.md - toolchain/rustc - Git at Google

 Your friendly guide to understanding the performance characteristics of this
 crate.

 This guide assumes some familiarity with the public API of this crate, which
 can be found here: http://doc.rust-lang.org/regex/regex/index.html

 ## Theory vs. Practice

 One of the design goals of this crate is to provide worst case linear time
 behavior with respect to the text searched using finite state automata. This
 means that, *in theory*, the performance of this crate is much better than most
 regex implementations, which typically use backtracking which has worst case
 exponential time.

 For example, try opening a Python interpreter and typing this:

     >>> import re
     >>> re.search('(a*)*c', 'a' * 30).span()

 I'll wait.

 At some point, you'll figure out that it won't terminate any time soon. ^C it.

 The promise of this crate is that *this pathological behavior can't happen*.

 With that said, just because we have protected ourselves against worst case
 exponential behavior doesn't mean we are immune from large constant factors
 or places where the current regex engine isn't quite optimal. This guide will
 detail those cases and provide guidance on how to avoid them, among other
 bits of general advice.

 ## Thou Shalt Not Compile Regular Expressions In A Loop

 **Advice**: Use `lazy_static` to amortize the cost of `Regex` compilation.

 Don't do it unless you really don't mind paying for it. Compiling a regular
 expression in this crate is quite expensive. It is conceivable that it may get
 faster some day, but I wouldn't hold out hope for, say, an order of magnitude
 improvement. In particular, compilation can take any where from a few dozen
 microseconds to a few dozen milliseconds. Yes, milliseconds. Unicode character
 classes, in particular, have the largest impact on compilation performance. At
 the time of writing, for example, `\pL{100}` takes around 44ms to compile. This
 is because `\pL` corresponds to every letter in Unicode and compilation must
 turn it into a proper automaton that decodes a subset of UTF-8 which
 corresponds to those letters. Compilation also spends some cycles shrinking the
 size of the automaton.

 This means that in order to realize efficient regex matching, one must
 *amortize the cost of compilation*. Trivially, if a call to `is_match` is
 inside a loop, then make sure your call to `Regex::new` is *outside* that loop.

 In many programming languages, regular expressions can be conveniently defined
 and compiled in a global scope, and code can reach out and use them as if
 they were global static variables. In Rust, there is really no concept of
 life-before-main, and therefore, one cannot utter this:

     static MY_REGEX: Regex = Regex::new("...").unwrap();

 Unfortunately, this would seem to imply that one must pass `Regex` objects
 around to everywhere they are used, which can be especially painful depending
 on how your program is structured. Thankfully, the
 [`lazy_static`](https://crates.io/crates/lazy_static)
 crate provides an answer that works well:

     #[macro_use] extern crate lazy_static;
     extern crate regex;

     use regex::Regex;

     fn some_helper_function(text: &str) -> bool {
         lazy_static! {
             static ref MY_REGEX: Regex = Regex::new("...").unwrap();
         }
         MY_REGEX.is_match(text)
     }

 In other words, the `lazy_static!` macro enables us to define a `Regex` *as if*
 it were a global static value. What is actually happening under the covers is
 that the code inside the macro (i.e., `Regex::new(...)`) is run on *first use*
 of `MY_REGEX` via a `Deref` impl. The implementation is admittedly magical, but
 it's self contained and everything works exactly as you expect. In particular,
 `MY_REGEX` can be used from multiple threads without wrapping it in an `Arc` or
 a `Mutex`. On that note...

 ## Using a regex from multiple threads

 **Advice**: The performance impact from using a `Regex` from multiple threads
 is likely negligible. If necessary, clone the `Regex` so that each thread gets
 its own copy. Cloning a regex does not incur any additional memory overhead
 than what would be used by using a `Regex` from multiple threads
 simultaneously. *Its only cost is ergonomics.*

 It is supported and encouraged to define your regexes using `lazy_static!` as
 if they were global static values, and then use them to search text from
 multiple threads simultaneously.

 One might imagine that this is possible because a `Regex` represents a
 *compiled* program, so that any allocation or mutation is already done, and is
 therefore read-only. Unfortunately, this is not true. Each type of search
 strategy in this crate requires some kind of mutable scratch space to use
 *during search*. For example, when executing a DFA, its states are computed
 lazily and reused on subsequent searches. Those states go into that mutable
 scratch space.

 The mutable scratch space is an implementation detail, and in general, its
 mutation should not be observable from users of this crate. Therefore, it uses
 interior mutability. This implies that `Regex` can either only be used from one
 thread, or it must do some sort of synchronization. Either choice is
 reasonable, but this crate chooses the latter, in particular because it is
 ergonomic and makes use with `lazy_static!` straight forward.

 Synchronization implies *some* amount of overhead. When a `Regex` is used from
 a single thread, this overhead is negligible. When a `Regex` is used from
 multiple threads simultaneously, it is possible for the overhead of
 synchronization from contention to impact performance. The specific cases where
 contention may happen is if you are calling any of these methods repeatedly
 from multiple threads simultaneously:

 * shortest_match
 * is_match
 * find
 * captures

 In particular, every invocation of one of these methods must synchronize with
 other threads to retrieve its mutable scratch space before searching can start.
 If, however, you are using one of these methods:

 * find_iter
 * captures_iter

 Then you may not suffer from contention since the cost of synchronization is
 amortized on *construction of the iterator*. That is, the mutable scratch space
 is obtained when the iterator is created and retained throughout its lifetime.

 ## Only ask for what you need

 **Advice**: Prefer in this order: `is_match`, `find`, `captures`.

 There are three primary search methods on a `Regex`:

 * is_match
 * find
 * captures

 In general, these are ordered from fastest to slowest.

 `is_match` is fastest because it doesn't actually need to find the start or the
 end of the leftmost-first match. It can quit immediately after it knows there
 is a match. For example, given the regex `a+` and the haystack, `aaaaa`, the
 search will quit after examing the first byte.

 In constrast, `find` must return both the start and end location of the
 leftmost-first match. It can use the DFA matcher for this, but must run it
 forwards once to find the end of the match *and then run it backwards* to find
 the start of the match. The two scans and the cost of finding the real end of
 the leftmost-first match make this more expensive than `is_match`.

 `captures` is the most expensive of them all because it must do what `find`
 does, and then run either the bounded backtracker or the Pike VM to fill in the
 capture group locations. Both of these are simulations of an NFA, which must
 spend a lot of time shuffling states around. The DFA limits the performance hit
 somewhat by restricting the amount of text that must be searched via an NFA
 simulation.

 One other method not mentioned is `shortest_match`. This method has precisely
 the same performance characteristics as `is_match`, except it will return the
 end location of when it discovered a match. For example, given the regex `a+`
 and the haystack `aaaaa`, `shortest_match` may return `1` as opposed to `5`,
 the latter of which being the correct end location of the leftmost-first match.

 ## Literals in your regex may make it faster

 **Advice**: Literals can reduce the work that the regex engine needs to do. Use
 them if you can, especially as prefixes.

 In particular, if your regex starts with a prefix literal, the prefix is
 quickly searched before entering the (much slower) regex engine. For example,
 given the regex `foo\w+`, the literal `foo` will be searched for using
 Boyer-Moore. If there's no match, then no regex engine is ever used. Only when
 there's a match is the regex engine invoked at the location of the match, which
 effectively permits the regex engine to skip large portions of a haystack.
 If a regex is comprised entirely of literals (possibly more than one), then
 it's possible that the regex engine can be avoided entirely even when there's a
 match.

 When one literal is found, Boyer-Moore is used. When multiple literals are
 found, then an optimized version of Aho-Corasick is used.

 This optimization is in particular extended quite a bit in this crate. Here are
 a few examples of regexes that get literal prefixes detected:

 * `(foo|bar)` detects `foo` and `bar`
 * `(a|b)c` detects `ac` and `bc`
 * `[ab]foo[yz]` detects `afooy`, `afooz`, `bfooy` and `bfooz`
 * `a?b` detects `a` and `b`
 * `a*b` detects `a` and `b`
 * `(ab){3,6}` detects `ababab`

 Literals in anchored regexes can also be used for detecting non-matches very
 quickly. For example, `^foo\w+` and `\w+foo$` may be able to detect a non-match
 just by examing the first (or last) three bytes of the haystack.

 ## Unicode word boundaries may prevent the DFA from being used

 **Advice**: In most cases, `\b` should work well. If not, use `(?-u:\b)`
 instead of `\b` if you care about consistent performance more than correctness.

 It's a sad state of the current implementation. At the moment, the DFA will try
 to interpret Unicode word boundaries as if they were ASCII word boundaries.
 If the DFA comes across any non-ASCII byte, it will quit and fall back to an
 alternative matching engine that can handle Unicode word boundaries correctly.
 The alternate matching engine is generally quite a bit slower (perhaps by an
 order of magnitude). If necessary, this can be ameliorated in two ways.

 The first way is to add some number of literal prefixes to your regular
 expression. Even though the DFA may not be used, specialized routines will
 still kick in to find prefix literals quickly, which limits how much work the
 NFA simulation will need to do.

 The second way is to give up on Unicode and use an ASCII word boundary instead.
 One can use an ASCII word boundary by disabling Unicode support. That is,
 instead of using `\b`, use `(?-u:\b)`.  Namely, given the regex `\b.+\b`, it
 can be transformed into a regex that uses the DFA with `(?-u:\b).+(?-u:\b)`. It
 is important to limit the scope of disabling the `u` flag, since it might lead
 to a syntax error if the regex could match arbitrary bytes. For example, if one
 wrote `(?-u)\b.+\b`, then a syntax error would be returned because `.` matches
 any *byte* when the Unicode flag is disabled.

 The second way isn't appreciably different than just using a Unicode word
 boundary in the first place, since the DFA will speculatively interpret it as
 an ASCII word boundary anyway. The key difference is that if an ASCII word
 boundary is used explicitly, then the DFA won't quit in the presence of
 non-ASCII UTF-8 bytes. This results in giving up correctness in exchange for
 more consistent performance.

 N.B. When using `bytes::Regex`, Unicode support is disabled by default, so one
 can simply write `\b` to get an ASCII word boundary.

 ## Excessive counting can lead to exponential state blow up in the DFA

 **Advice**: Don't write regexes that cause DFA state blow up if you care about
 match performance.

 Wait, didn't I say that this crate guards against exponential worst cases?
 Well, it turns out that the process of converting an NFA to a DFA can lead to
 an exponential blow up in the number of states. This crate specifically guards
 against exponential blow up by doing two things:

 1. The DFA is computed lazily. That is, a state in the DFA only exists in
    memory if it is visited. In particular, the lazy DFA guarantees that *at
    most* one state is created for every byte of input. This, on its own,
    guarantees linear time complexity.
 2. Of course, creating a new state for *every* byte of input means that search
    will go incredibly slow because of very large constant factors. On top of
    that, creating a state for every byte in a large haystack could result in
    exorbitant memory usage. To ameliorate this, the DFA bounds the number of
    states it can store. Once it reaches its limit, it flushes its cache. This
    prevents reuse of states that it already computed. If the cache is flushed
    too frequently, then the DFA will give up and execution will fall back to
    one of the NFA simulations.

 In effect, this crate will detect exponential state blow up and fall back to
 a search routine with fixed memory requirements. This does, however, mean that
 searching will be much slower than one might expect. Regexes that rely on
 counting in particular are strong aggravators of this behavior. For example,
 matching `[01]*1[01]{20}$` against a random sequence of `0`s and `1`s.

 In the future, it may be possible to increase the bound that the DFA uses,
 which would allow the caller to choose how much memory they're willing to
 spend.

 ## Resist the temptation to "optimize" regexes

 **Advice**: This ain't a backtracking engine.

 An entire book was written on how to optimize Perl-style regular expressions.
 Most of those techniques are not applicable for this library. For example,
 there is no problem with using non-greedy matching or having lots of
 alternations in your regex.
	Your friendly guide to understanding the performance characteristics of this
	crate.

	This guide assumes some familiarity with the public API of this crate, which
	can be found here: http://doc.rust-lang.org/regex/regex/index.html

	## Theory vs. Practice

	One of the design goals of this crate is to provide worst case linear time
	behavior with respect to the text searched using finite state automata. This
	means that, in theory, the performance of this crate is much better than most
	regex implementations, which typically use backtracking which has worst case
	exponential time.

	For example, try opening a Python interpreter and typing this:

	>>> import re
	>>> re.search('(a)c', 'a' * 30).span()

	I'll wait.

	At some point, you'll figure out that it won't terminate any time soon. ^C it.

	The promise of this crate is that this pathological behavior can't happen.

	With that said, just because we have protected ourselves against worst case
	exponential behavior doesn't mean we are immune from large constant factors
	or places where the current regex engine isn't quite optimal. This guide will
	detail those cases and provide guidance on how to avoid them, among other
	bits of general advice.

	## Thou Shalt Not Compile Regular Expressions In A Loop

	Advice: Use `lazy_static` to amortize the cost of `Regex` compilation.

	Don't do it unless you really don't mind paying for it. Compiling a regular
	expression in this crate is quite expensive. It is conceivable that it may get
	faster some day, but I wouldn't hold out hope for, say, an order of magnitude
	improvement. In particular, compilation can take any where from a few dozen
	microseconds to a few dozen milliseconds. Yes, milliseconds. Unicode character
	classes, in particular, have the largest impact on compilation performance. At
	the time of writing, for example, `\pL{100}` takes around 44ms to compile. This
	is because `\pL` corresponds to every letter in Unicode and compilation must
	turn it into a proper automaton that decodes a subset of UTF-8 which
	corresponds to those letters. Compilation also spends some cycles shrinking the
	size of the automaton.

	This means that in order to realize efficient regex matching, one must
	amortize the cost of compilation. Trivially, if a call to `is_match` is
	inside a loop, then make sure your call to `Regex::new` is outside that loop.

	In many programming languages, regular expressions can be conveniently defined
	and compiled in a global scope, and code can reach out and use them as if
	they were global static variables. In Rust, there is really no concept of
	life-before-main, and therefore, one cannot utter this:

	static MY_REGEX: Regex = Regex::new("...").unwrap();

	Unfortunately, this would seem to imply that one must pass `Regex` objects
	around to everywhere they are used, which can be especially painful depending
	on how your program is structured. Thankfully, the
	[`lazy_static`](https://crates.io/crates/lazy_static)
	crate provides an answer that works well:

	#[macro_use] extern crate lazy_static;
	extern crate regex;

	use regex::Regex;

	fn some_helper_function(text: &str) -> bool {
	lazy_static! {
	static ref MY_REGEX: Regex = Regex::new("...").unwrap();
	}
	MY_REGEX.is_match(text)
	}

	In other words, the `lazy_static!` macro enables us to define a `Regex` as if
	it were a global static value. What is actually happening under the covers is
	that the code inside the macro (i.e., `Regex::new(...)`) is run on first use
	of `MY_REGEX` via a `Deref` impl. The implementation is admittedly magical, but
	it's self contained and everything works exactly as you expect. In particular,
	`MY_REGEX` can be used from multiple threads without wrapping it in an `Arc` or
	a `Mutex`. On that note...

	## Using a regex from multiple threads

	Advice: The performance impact from using a `Regex` from multiple threads
	is likely negligible. If necessary, clone the `Regex` so that each thread gets
	its own copy. Cloning a regex does not incur any additional memory overhead
	than what would be used by using a `Regex` from multiple threads
	simultaneously. Its only cost is ergonomics.

	It is supported and encouraged to define your regexes using `lazy_static!` as
	if they were global static values, and then use them to search text from
	multiple threads simultaneously.

	One might imagine that this is possible because a `Regex` represents a
	compiled program, so that any allocation or mutation is already done, and is
	therefore read-only. Unfortunately, this is not true. Each type of search
	strategy in this crate requires some kind of mutable scratch space to use
	during search. For example, when executing a DFA, its states are computed
	lazily and reused on subsequent searches. Those states go into that mutable
	scratch space.

	The mutable scratch space is an implementation detail, and in general, its
	mutation should not be observable from users of this crate. Therefore, it uses
	interior mutability. This implies that `Regex` can either only be used from one
	thread, or it must do some sort of synchronization. Either choice is
	reasonable, but this crate chooses the latter, in particular because it is
	ergonomic and makes use with `lazy_static!` straight forward.

	Synchronization implies some amount of overhead. When a `Regex` is used from
	a single thread, this overhead is negligible. When a `Regex` is used from
	multiple threads simultaneously, it is possible for the overhead of
	synchronization from contention to impact performance. The specific cases where
	contention may happen is if you are calling any of these methods repeatedly
	from multiple threads simultaneously:

	* shortest_match
	* is_match
	* find
	* captures

	In particular, every invocation of one of these methods must synchronize with
	other threads to retrieve its mutable scratch space before searching can start.
	If, however, you are using one of these methods:

	* find_iter
	* captures_iter

	Then you may not suffer from contention since the cost of synchronization is
	amortized on construction of the iterator. That is, the mutable scratch space
	is obtained when the iterator is created and retained throughout its lifetime.

	## Only ask for what you need

	Advice: Prefer in this order: `is_match`, `find`, `captures`.

	There are three primary search methods on a `Regex`:

	* is_match
	* find
	* captures

	In general, these are ordered from fastest to slowest.

	`is_match` is fastest because it doesn't actually need to find the start or the
	end of the leftmost-first match. It can quit immediately after it knows there
	is a match. For example, given the regex `a+` and the haystack, `aaaaa`, the
	search will quit after examing the first byte.

	In constrast, `find` must return both the start and end location of the
	leftmost-first match. It can use the DFA matcher for this, but must run it
	forwards once to find the end of the match and then run it backwards to find
	the start of the match. The two scans and the cost of finding the real end of
	the leftmost-first match make this more expensive than `is_match`.

	`captures` is the most expensive of them all because it must do what `find`
	does, and then run either the bounded backtracker or the Pike VM to fill in the
	capture group locations. Both of these are simulations of an NFA, which must
	spend a lot of time shuffling states around. The DFA limits the performance hit
	somewhat by restricting the amount of text that must be searched via an NFA
	simulation.

	One other method not mentioned is `shortest_match`. This method has precisely
	the same performance characteristics as `is_match`, except it will return the
	end location of when it discovered a match. For example, given the regex `a+`
	and the haystack `aaaaa`, `shortest_match` may return `1` as opposed to `5`,
	the latter of which being the correct end location of the leftmost-first match.

	## Literals in your regex may make it faster

	Advice: Literals can reduce the work that the regex engine needs to do. Use
	them if you can, especially as prefixes.

	In particular, if your regex starts with a prefix literal, the prefix is
	quickly searched before entering the (much slower) regex engine. For example,
	given the regex `foo\w+`, the literal `foo` will be searched for using
	Boyer-Moore. If there's no match, then no regex engine is ever used. Only when
	there's a match is the regex engine invoked at the location of the match, which
	effectively permits the regex engine to skip large portions of a haystack.
	If a regex is comprised entirely of literals (possibly more than one), then
	it's possible that the regex engine can be avoided entirely even when there's a
	match.

	When one literal is found, Boyer-Moore is used. When multiple literals are
	found, then an optimized version of Aho-Corasick is used.

	This optimization is in particular extended quite a bit in this crate. Here are
	a few examples of regexes that get literal prefixes detected:

	* `(foo\|bar)` detects `foo` and `bar`
	* `(a\|b)c` detects `ac` and `bc`
	* `[ab]foo[yz]` detects `afooy`, `afooz`, `bfooy` and `bfooz`
	* `a?b` detects `a` and `b`
	* `a*b` detects `a` and `b`
	* `(ab){3,6}` detects `ababab`

	Literals in anchored regexes can also be used for detecting non-matches very
	quickly. For example, `^foo\w+` and `\w+foo$` may be able to detect a non-match
	just by examing the first (or last) three bytes of the haystack.

	## Unicode word boundaries may prevent the DFA from being used

	Advice: In most cases, `\b` should work well. If not, use `(?-u:\b)`
	instead of `\b` if you care about consistent performance more than correctness.

	It's a sad state of the current implementation. At the moment, the DFA will try
	to interpret Unicode word boundaries as if they were ASCII word boundaries.
	If the DFA comes across any non-ASCII byte, it will quit and fall back to an
	alternative matching engine that can handle Unicode word boundaries correctly.
	The alternate matching engine is generally quite a bit slower (perhaps by an
	order of magnitude). If necessary, this can be ameliorated in two ways.

	The first way is to add some number of literal prefixes to your regular
	expression. Even though the DFA may not be used, specialized routines will
	still kick in to find prefix literals quickly, which limits how much work the
	NFA simulation will need to do.

	The second way is to give up on Unicode and use an ASCII word boundary instead.
	One can use an ASCII word boundary by disabling Unicode support. That is,
	instead of using `\b`, use `(?-u:\b)`. Namely, given the regex `\b.+\b`, it
	can be transformed into a regex that uses the DFA with `(?-u:\b).+(?-u:\b)`. It
	is important to limit the scope of disabling the `u` flag, since it might lead
	to a syntax error if the regex could match arbitrary bytes. For example, if one
	wrote `(?-u)\b.+\b`, then a syntax error would be returned because `.` matches
	any byte when the Unicode flag is disabled.

	The second way isn't appreciably different than just using a Unicode word
	boundary in the first place, since the DFA will speculatively interpret it as
	an ASCII word boundary anyway. The key difference is that if an ASCII word
	boundary is used explicitly, then the DFA won't quit in the presence of
	non-ASCII UTF-8 bytes. This results in giving up correctness in exchange for
	more consistent performance.

	N.B. When using `bytes::Regex`, Unicode support is disabled by default, so one
	can simply write `\b` to get an ASCII word boundary.

	## Excessive counting can lead to exponential state blow up in the DFA

	Advice: Don't write regexes that cause DFA state blow up if you care about
	match performance.

	Wait, didn't I say that this crate guards against exponential worst cases?
	Well, it turns out that the process of converting an NFA to a DFA can lead to
	an exponential blow up in the number of states. This crate specifically guards
	against exponential blow up by doing two things:

	1. The DFA is computed lazily. That is, a state in the DFA only exists in
	memory if it is visited. In particular, the lazy DFA guarantees that *at
	most* one state is created for every byte of input. This, on its own,
	guarantees linear time complexity.
	2. Of course, creating a new state for every byte of input means that search
	will go incredibly slow because of very large constant factors. On top of
	that, creating a state for every byte in a large haystack could result in
	exorbitant memory usage. To ameliorate this, the DFA bounds the number of
	states it can store. Once it reaches its limit, it flushes its cache. This
	prevents reuse of states that it already computed. If the cache is flushed
	too frequently, then the DFA will give up and execution will fall back to
	one of the NFA simulations.

	In effect, this crate will detect exponential state blow up and fall back to
	a search routine with fixed memory requirements. This does, however, mean that
	searching will be much slower than one might expect. Regexes that rely on
	counting in particular are strong aggravators of this behavior. For example,
	matching `[01]*1[01]{20}$` against a random sequence of `0`s and `1`s.

	In the future, it may be possible to increase the bound that the DFA uses,
	which would allow the caller to choose how much memory they're willing to
	spend.

	## Resist the temptation to "optimize" regexes

	Advice: This ain't a backtracking engine.

	An entire book was written on how to optimize Perl-style regular expressions.
	Most of those techniques are not applicable for this library. For example,
	there is no problem with using non-greedy matching or having lots of
	alternations in your regex.