android/vendor/aho-corasick-1.1.3/DESIGN.md - toolchain/cargo-deny - Git at Google

 This document describes the internal design of this crate, which is an object
 lesson in what happens when you take a fairly simple old algorithm like
 Aho-Corasick and make it fast and production ready.

 The target audience of this document is Rust programmers that have some
 familiarity with string searching, however, one does not need to know the
 Aho-Corasick algorithm in order to read this (it is explained below). One
 should, however, know what a trie is. (If you don't, go read its Wikipedia
 article.)

 The center-piece of this crate is an implementation of Aho-Corasick. On its
 own, Aho-Corasick isn't that complicated. The complex pieces come from the
 different variants of Aho-Corasick implemented in this crate. Specifically,
 they are:

 * Aho-Corasick as a noncontiguous NFA. States have their transitions
   represented sparsely, and each state puts its transitions in its own separate
   allocation. Hence the same "noncontiguous."
 * Aho-Corasick as a contiguous NFA. This NFA uses a single allocation to
   represent the transitions of all states. That is, transitions are laid out
   contiguously in memory. Moreover, states near the starting state are
   represented densely, such that finding the next state ID takes a constant
   number of instructions.
 * Aho-Corasick as a DFA. In this case, all states are represented densely in
   a transition table that uses one allocation.
 * Supporting "standard" match semantics, along with its overlapping variant,
   in addition to leftmost-first and leftmost-longest semantics. The "standard"
   semantics are typically what you see in a textbook description of
   Aho-Corasick. However, Aho-Corasick is also useful as an optimization in
   regex engines, which often use leftmost-first or leftmost-longest semantics.
   Thus, it is useful to implement those semantics here. The "standard" and
   "leftmost" search algorithms are subtly different, and also require slightly
   different construction algorithms.
 * Support for ASCII case insensitive matching.
 * Support for accelerating searches when the patterns all start with a small
   number of fixed bytes. Or alternatively, when the  patterns all contain a
   small number of rare bytes. (Searching for these bytes uses SIMD vectorized
   code courtesy of `memchr`.)
 * Transparent support for alternative SIMD vectorized search routines for
   smaller number of literals, such as the Teddy algorithm. We called these
   "packed" search routines because they use SIMD. They can often be an order of
   magnitude faster than just Aho-Corasick, but don't scale as well.
 * Support for searching streams. This can reuse most of the underlying code,
   but does require careful buffering support.
 * Support for anchored searches, which permit efficient "is prefix" checks for
   a large number of patterns.

 When you combine all of this together along with trying to make everything as
 fast as possible, what you end up with is enitrely too much code with too much
 `unsafe`. Alas, I was not smart enough to figure out how to reduce it. Instead,
 we will explain it.


 # Basics

 The fundamental problem this crate is trying to solve is to determine the
 occurrences of possibly many patterns in a haystack. The naive way to solve
 this is to look for a match for each pattern at each position in the haystack:

     for i in 0..haystack.len():
       for p in patterns.iter():
         if haystack[i..].starts_with(p.bytes()):
           return Match(p.id(), i, i + p.bytes().len())

 Those four lines are effectively all this crate does. The problem with those
 four lines is that they are very slow, especially when you're searching for a
 large number of patterns.

 While there are many different algorithms available to solve this, a popular
 one is Aho-Corasick. It's a common solution because it's not too hard to
 implement, scales quite well even when searching for thousands of patterns and
 is generally pretty fast. Aho-Corasick does well here because, regardless of
 the number of patterns you're searching for, it always visits each byte in the
 haystack exactly once. This means, generally speaking, adding more patterns to
 an Aho-Corasick automaton does not make it slower. (Strictly speaking, however,
 this is not true, since a larger automaton will make less effective use of the
 CPU's cache.)

 Aho-Corasick can be succinctly described as a trie with state transitions
 between some of the nodes that efficiently instruct the search algorithm to
 try matching alternative keys in the trie. The trick is that these state
 transitions are arranged such that each byte of input needs to be inspected
 only once. These state transitions are typically called "failure transitions,"
 because they instruct the searcher (the thing traversing the automaton while
 reading from the haystack) what to do when a byte in the haystack does not
 correspond to a valid transition in the current state of the trie.

 More formally, a failure transition points to a state in the automaton that may
 lead to a match whose prefix is a proper suffix of the path traversed through
 the trie so far. (If no such proper suffix exists, then the failure transition
 points back to the start state of the trie, effectively restarting the search.)
 This is perhaps simpler to explain pictorally. For example, let's say we built
 an Aho-Corasick automaton with the following patterns: 'abcd' and 'cef'. The
 trie looks like this:

          a - S1 - b - S2 - c - S3 - d - S4*
         /
     S0 - c - S5 - e - S6 - f - S7*

 where states marked with a `*` are match states (meaning, the search algorithm
 should stop and report a match to the caller).

 So given this trie, it should be somewhat straight-forward to see how it can
 be used to determine whether any particular haystack *starts* with either
 `abcd` or `cef`. It's easy to express this in code:

     fn has_prefix(trie: &Trie, haystack: &[u8]) -> bool {
       let mut state_id = trie.start();
       // If the empty pattern is in trie, then state_id is a match state.
       if trie.is_match(state_id) {
         return true;
       }
       for (i, &b) in haystack.iter().enumerate() {
         state_id = match trie.next_state(state_id, b) {
           Some(id) => id,
           // If there was no transition for this state and byte, then we know
           // the haystack does not start with one of the patterns in our trie.
           None => return false,
         };
         if trie.is_match(state_id) {
           return true;
         }
       }
       false
     }

 And that's pretty much it. All we do is move through the trie starting with the
 bytes at the beginning of the haystack. If we find ourselves in a position
 where we can't move, or if we've looked through the entire haystack without
 seeing a match state, then we know the haystack does not start with any of the
 patterns in the trie.

 The meat of the Aho-Corasick algorithm is in how we add failure transitions to
 our trie to keep searching efficient. Specifically, it permits us to not only
 check whether a haystack *starts* with any one of a number of patterns, but
 rather, whether the haystack contains any of a number of patterns *anywhere* in
 the haystack.

 As mentioned before, failure transitions connect a proper suffix of the path
 traversed through the trie before, with a path that leads to a match that has a
 prefix corresponding to that proper suffix. So in our case, for patterns `abcd`
 and `cef`, with a haystack `abcef`, we want to transition to state `S5` (from
 the diagram above) from `S3` upon seeing that the byte following `c` is not
 `d`. Namely, the proper suffix in this example is `c`, which is a prefix of
 `cef`. So the modified diagram looks like this:


          a - S1 - b - S2 - c - S3 - d - S4*
         /                      /
        /       ----------------
       /       /
     S0 - c - S5 - e - S6 - f - S7*

 One thing that isn't shown in this diagram is that *all* states have a failure
 transition, but only `S3` has a *non-trivial* failure transition. That is, all
 other states have a failure transition back to the start state. So if our
 haystack was `abzabcd`, then the searcher would transition back to `S0` after
 seeing `z`, which effectively restarts the search. (Because there is no pattern
 in our trie that has a prefix of `bz` or `z`.)

 The code for traversing this *automaton* or *finite state machine* (it is no
 longer just a trie) is not that much different from the `has_prefix` code
 above:

     fn contains(fsm: &FiniteStateMachine, haystack: &[u8]) -> bool {
       let mut state_id = fsm.start();
       // If the empty pattern is in fsm, then state_id is a match state.
       if fsm.is_match(state_id) {
         return true;
       }
       for (i, &b) in haystack.iter().enumerate() {
         // While the diagram above doesn't show this, we may wind up needing
         // to follow multiple failure transitions before we land on a state
         // in which we can advance. Therefore, when searching for the next
         // state, we need to loop until we don't see a failure transition.
         //
         // This loop terminates because the start state has no empty
         // transitions. Every transition from the start state either points to
         // another state, or loops back to the start state.
         loop {
           match fsm.next_state(state_id, b) {
             Some(id) => {
               state_id = id;
               break;
             }
             // Unlike our code above, if there was no transition for this
             // state, then we don't quit. Instead, we look for this state's
             // failure transition and follow that instead.
             None => {
               state_id = fsm.next_fail_state(state_id);
             }
           };
         }
         if fsm.is_match(state_id) {
           return true;
         }
       }
       false
     }

 Other than the complication around traversing failure transitions, this code
 is still roughly "traverse the automaton with bytes from the haystack, and quit
 when a match is seen."

 And that concludes our section on the basics. While we didn't go deep into how
 the automaton is built (see `src/nfa/noncontiguous.rs`, which has detailed
 comments about that), the basic structure of Aho-Corasick should be reasonably
 clear.


 # NFAs and DFAs

 There are generally two types of finite automata: non-deterministic finite
 automata (NFA) and deterministic finite automata (DFA). The difference between
 them is, principally, that an NFA can be in multiple states at once. This is
 typically accomplished by things called _epsilon_ transitions, where one could
 move to a new state without consuming any bytes from the input. (The other
 mechanism by which NFAs can be in more than one state is where the same byte in
 a particular state transitions to multiple distinct states.) In contrast, a DFA
 can only ever be in one state at a time. A DFA has no epsilon transitions, and
 for any given state, a byte transitions to at most one other state.

 By this formulation, the Aho-Corasick automaton described in the previous
 section is an NFA. This is because failure transitions are, effectively,
 epsilon transitions. That is, whenever the automaton is in state `S`, it is
 actually in the set of states that are reachable by recursively following
 failure transitions from `S` until you reach the start state. (This means
 that, for example, the start state is always active since the start state is
 reachable via failure transitions from any state in the automaton.)

 NFAs have a lot of nice properties. They tend to be easier to construct, and
 also tend to use less memory. However, their primary downside is that they are
 typically slower to execute a search with. For example, the code above showing
 how to search with an Aho-Corasick automaton needs to potentially iterate
 through many failure transitions for every byte of input. While this is a
 fairly small amount of overhead, this can add up, especially if the automaton
 has a lot of overlapping patterns with a lot of failure transitions.

 A DFA's search code, by contrast, looks like this:

     fn contains(dfa: &DFA, haystack: &[u8]) -> bool {
       let mut state_id = dfa.start();
       // If the empty pattern is in dfa, then state_id is a match state.
       if dfa.is_match(state_id) {
         return true;
       }
       for (i, &b) in haystack.iter().enumerate() {
         // An Aho-Corasick DFA *never* has a missing state that requires
         // failure transitions to be followed. One byte of input advances the
         // automaton by one state. Always.
         state_id = dfa.next_state(state_id, b);
         if dfa.is_match(state_id) {
           return true;
         }
       }
       false
     }

 The search logic here is much simpler than for the NFA, and this tends to
 translate into significant performance benefits as well, since there's a lot
 less work being done for each byte in the haystack. How is this accomplished?
 It's done by pre-following all failure transitions for all states for all bytes
 in the alphabet, and then building a single state transition table. Building
 this DFA can be much more costly than building the NFA, and use much more
 memory, but the better performance can be worth it.

 Users of this crate can actually choose between using one of two possible NFAs
 (noncontiguous or contiguous) or a DFA. By default, a contiguous NFA is used,
 in most circumstances, but if the number of patterns is small enough a DFA will
 be used. A contiguous NFA is chosen because it uses orders of magnitude less
 memory than a DFA, takes only a little longer to build than a noncontiguous
 NFA and usually gets pretty close to the search speed of a DFA. (Callers can
 override this automatic selection via the `AhoCorasickBuilder::start_kind`
 configuration.)


 # More DFA tricks

 As described in the previous section, one of the downsides of using a DFA
 is that it uses more memory and can take longer to build. One small way of
 mitigating these concerns is to map the alphabet used by the automaton into
 a smaller space. Typically, the alphabet of a DFA has 256 elements in it:
 one element for each possible value that fits into a byte. However, in many
 cases, one does not need the full alphabet. For example, if all patterns in an
 Aho-Corasick automaton are ASCII letters, then this only uses up 52 distinct
 bytes. As far as the automaton is concerned, the rest of the 204 bytes are
 indistinguishable from one another: they will never disrciminate between a
 match or a non-match. Therefore, in cases like that, the alphabet can be shrunk
 to just 53 elements. One for each ASCII letter, and then another to serve as a
 placeholder for every other unused byte.

 In practice, this library doesn't quite compute the optimal set of equivalence
 classes, but it's close enough in most cases. The key idea is that this then
 allows the transition table for the DFA to be potentially much smaller. The
 downside of doing this, however, is that since the transition table is defined
 in terms of this smaller alphabet space, every byte in the haystack must be
 re-mapped to this smaller space. This requires an additional 256-byte table.
 In practice, this can lead to a small search time hit, but it can be difficult
 to measure. Moreover, it can sometimes lead to faster search times for bigger
 automata, since it could be difference between more parts of the automaton
 staying in the CPU cache or not.

 One other trick for DFAs employed by this crate is the notion of premultiplying
 state identifiers. Specifically, the normal way to compute the next transition
 in a DFA is via the following (assuming that the transition table is laid out
 sequentially in memory, in row-major order, where the rows are states):

     next_state_id = dfa.transitions[current_state_id * 256 + current_byte]

 However, since the value `256` is a fixed constant, we can actually premultiply
 the state identifiers in the table when we build the table initially. Then, the
 next transition computation simply becomes:

     next_state_id = dfa.transitions[current_state_id + current_byte]

 This doesn't seem like much, but when this is being executed for every byte of
 input that you're searching, saving that extra multiplication instruction can
 add up.

 The same optimization works even when equivalence classes are enabled, as
 described above. The only difference is that the premultiplication is by the
 total number of equivalence classes instead of 256.

 There isn't much downside to premultiplying state identifiers, other than it
 imposes a smaller limit on the total number of states in the DFA. Namely, with
 premultiplied state identifiers, you run out of room in your state identifier
 representation more rapidly than if the identifiers are just state indices.

 Both equivalence classes and premultiplication are always enabled. There is a
 `AhoCorasickBuilder::byte_classes` configuration, but disabling this just makes
 it so there are always 256 equivalence classes, i.e., every class corresponds
 to precisely one byte. When it's disabled, the equivalence class map itself is
 still used. The purpose of disabling it is when one is debugging the underlying
 automaton. It can be easier to comprehend when it uses actual byte values for
 its transitions instead of equivalence classes.


 # Match semantics

 One of the more interesting things about this implementation of Aho-Corasick
 that (as far as this author knows) separates it from other implementations, is
 that it natively supports leftmost-first and leftmost-longest match semantics.
 Briefly, match semantics refer to the decision procedure by which searching
 will disambiguate matches when there are multiple to choose from:

 * **standard** match semantics emits matches as soon as they are detected by
   the automaton. This is typically equivalent to the textbook non-overlapping
   formulation of Aho-Corasick.
 * **leftmost-first** match semantics means that 1) the next match is the match
   starting at the leftmost position and 2) among multiple matches starting at
   the same leftmost position, the match corresponding to the pattern provided
   first by the caller is reported.
 * **leftmost-longest** is like leftmost-first, except when there are multiple
   matches starting at the same leftmost position, the pattern corresponding to
   the longest match is returned.

 (The crate API documentation discusses these differences, with examples, in
 more depth on the `MatchKind` type.)

 The reason why supporting these match semantics is important is because it
 gives the user more control over the match procedure. For example,
 leftmost-first permits users to implement match priority by simply putting the
 higher priority patterns first. Leftmost-longest, on the other hand, permits
 finding the longest possible match, which might be useful when trying to find
 words matching a dictionary. Additionally, regex engines often want to use
 Aho-Corasick as an optimization when searching for an alternation of literals.
 In order to preserve correct match semantics, regex engines typically can't use
 the standard textbook definition directly, since regex engines will implement
 either leftmost-first (Perl-like) or leftmost-longest (POSIX) match semantics.

 Supporting leftmost semantics requires a couple key changes:

 * Constructing the Aho-Corasick automaton changes a bit in both how the trie is
   constructed and how failure transitions are found. Namely, only a subset
   of the failure transitions are added. Specifically, only the failure
   transitions that either do not occur after a match or do occur after a match
   but preserve that match are kept. (More details on this can be found in
   `src/nfa/noncontiguous.rs`.)
 * The search algorithm changes slightly. Since we are looking for the leftmost
   match, we cannot quit as soon as a match is detected. Instead, after a match
   is detected, we must keep searching until either the end of the input or
   until a dead state is seen. (Dead states are not used for standard match
   semantics. Dead states mean that searching should stop after a match has been
   found.)

 Most other implementations of Aho-Corasick do support leftmost match semantics,
 but they do it with more overhead at search time, or even worse, with a queue
 of matches and sophisticated hijinks to disambiguate the matches. While our
 construction algorithm becomes a bit more complicated, the correct match
 semantics fall out from the structure of the automaton itself.


 # Overlapping matches

 One of the nice properties of an Aho-Corasick automaton is that it can report
 all possible matches, even when they overlap with one another. In this mode,
 the match semantics don't matter, since all possible matches are reported.
 Overlapping searches work just like regular searches, except the state
 identifier at which the previous search left off is carried over to the next
 search, so that it can pick up where it left off. If there are additional
 matches at that state, then they are reported before resuming the search.

 Enabling leftmost-first or leftmost-longest match semantics causes the
 automaton to use a subset of all failure transitions, which means that
 overlapping searches cannot be used. Therefore, if leftmost match semantics are
 used, attempting to do an overlapping search will return an error (or panic
 when using the infallible APIs). Thus, to get overlapping searches, the caller
 must use the default standard match semantics. This behavior was chosen because
 there are only two alternatives, which were deemed worse:

 * Compile two automatons internally, one for standard semantics and one for
   the semantics requested by the caller (if not standard).
 * Create a new type, distinct from the `AhoCorasick` type, which has different
   capabilities based on the configuration options.

 The first is untenable because of the amount of memory used by the automaton.
 The second increases the complexity of the API too much by adding too many
 types that do similar things. It is conceptually much simpler to keep all
 searching isolated to a single type.


 # Stream searching

 Since Aho-Corasick is an automaton, it is possible to do partial searches on
 partial parts of the haystack, and then resume that search on subsequent pieces
 of the haystack. This is useful when the haystack you're trying to search is
 not stored contiguously in memory, or if one does not want to read the entire
 haystack into memory at once.

 Currently, only standard semantics are supported for stream searching. This is
 some of the more complicated code in this crate, and is something I would very
 much like to improve. In particular, it currently has the restriction that it
 must buffer at least enough of the haystack in memory in order to fit the
 longest possible match. The difficulty in getting stream searching right is
 that the implementation choices (such as the buffer size) often impact what the
 API looks like and what it's allowed to do.


 # Prefilters

 In some cases, Aho-Corasick is not the fastest way to find matches containing
 multiple patterns. Sometimes, the search can be accelerated using highly
 optimized SIMD routines. For example, consider searching the following
 patterns:

     Sherlock
     Moriarty
     Watson

 It is plausible that it would be much faster to quickly look for occurrences of
 the leading bytes, `S`, `M` or `W`, before trying to start searching via the
 automaton. Indeed, this is exactly what this crate will do.

 When there are more than three distinct starting bytes, then this crate will
 look for three distinct bytes occurring at any position in the patterns, while
 preferring bytes that are heuristically determined to be rare over others. For
 example:

     Abuzz
     Sanchez
     Vasquez
     Topaz
     Waltz

 Here, we have more than 3 distinct starting bytes, but all of the patterns
 contain `z`, which is typically a rare byte. In this case, the prefilter will
 scan for `z`, back up a bit, and then execute the Aho-Corasick automaton.

 If all of that fails, then a packed multiple substring algorithm will be
 attempted. Currently, the only algorithm available for this is Teddy, but more
 may be added in the future. Teddy is unlike the above prefilters in that it
 confirms its own matches, so when Teddy is active, it might not be necessary
 for Aho-Corasick to run at all. However, the current Teddy implementation
 only works in `x86_64` when SSSE3 or AVX2 are available or in `aarch64`
 (using NEON), and moreover, only works _well_ when there are a small number
 of patterns (say, less than 100). Teddy also requires the haystack to be of a
 certain length (more than 16-34 bytes). When the haystack is shorter than that,
 Rabin-Karp is used instead. (See `src/packed/rabinkarp.rs`.)

 There is a more thorough description of Teddy at
 [`src/packed/teddy/README.md`](src/packed/teddy/README.md).