| /*! |
| Provides [`Automaton`] trait for abstracting over Aho-Corasick automata. |
| |
| The `Automaton` trait provides a way to write generic code over any |
| Aho-Corasick automaton. It also provides access to lower level APIs that |
| permit walking the state transitions of an Aho-Corasick automaton manually. |
| */ |
| |
| use alloc::{string::String, vec::Vec}; |
| |
| use crate::util::{ |
| error::MatchError, |
| primitives::PatternID, |
| search::{Anchored, Input, Match, MatchKind, Span}, |
| }; |
| |
| pub use crate::util::{ |
| prefilter::{Candidate, Prefilter}, |
| primitives::{StateID, StateIDError}, |
| }; |
| |
| /// We seal the `Automaton` trait for now. It's a big trait, and it's |
| /// conceivable that I might want to add new required methods, and sealing the |
| /// trait permits doing that in a backwards compatible fashion. On other the |
| /// hand, if you have a solid use case for implementing the trait yourself, |
| /// please file an issue and we can discuss it. This was *mostly* done as a |
| /// conservative step. |
| pub(crate) mod private { |
| pub trait Sealed {} |
| } |
| impl private::Sealed for crate::nfa::noncontiguous::NFA {} |
| impl private::Sealed for crate::nfa::contiguous::NFA {} |
| impl private::Sealed for crate::dfa::DFA {} |
| |
| impl<'a, T: private::Sealed + ?Sized> private::Sealed for &'a T {} |
| |
| /// A trait that abstracts over Aho-Corasick automata. |
| /// |
| /// This trait primarily exists for niche use cases such as: |
| /// |
| /// * Using an NFA or DFA directly, bypassing the top-level |
| /// [`AhoCorasick`](crate::AhoCorasick) searcher. Currently, these include |
| /// [`noncontiguous::NFA`](crate::nfa::noncontiguous::NFA), |
| /// [`contiguous::NFA`](crate::nfa::contiguous::NFA) and |
| /// [`dfa::DFA`](crate::dfa::DFA). |
| /// * Implementing your own custom search routine by walking the automaton |
| /// yourself. This might be useful for implementing search on non-contiguous |
| /// strings or streams. |
| /// |
| /// For most use cases, it is not expected that users will need |
| /// to use or even know about this trait. Indeed, the top level |
| /// [`AhoCorasick`](crate::AhoCorasick) searcher does not expose any details |
| /// about this trait, nor does it implement it itself. |
| /// |
| /// Note that this trait defines a number of default methods, such as |
| /// [`Automaton::try_find`] and [`Automaton::try_find_iter`], which implement |
| /// higher level search routines in terms of the lower level automata API. |
| /// |
| /// # Sealed |
| /// |
| /// Currently, this trait is sealed. That means users of this crate can write |
| /// generic routines over this trait but cannot implement it themselves. This |
| /// restriction may be lifted in the future, but sealing the trait permits |
| /// adding new required methods in a backwards compatible fashion. |
| /// |
| /// # Special states |
| /// |
| /// This trait encodes a notion of "special" states in an automaton. Namely, |
| /// a state is treated as special if it is a dead, match or start state: |
| /// |
| /// * A dead state is a state that cannot be left once entered. All transitions |
| /// on a dead state lead back to itself. The dead state is meant to be treated |
| /// as a sentinel indicating that the search should stop and return a match if |
| /// one has been found, and nothing otherwise. |
| /// * A match state is a state that indicates one or more patterns have |
| /// matched. Depending on the [`MatchKind`] of the automaton, a search may |
| /// stop once a match is seen, or it may continue looking for matches until |
| /// it enters a dead state or sees the end of the haystack. |
| /// * A start state is a state that a search begins in. It is useful to know |
| /// when a search enters a start state because it may mean that a prefilter can |
| /// be used to skip ahead and quickly look for candidate matches. Unlike dead |
| /// and match states, it is never necessary to explicitly handle start states |
| /// for correctness. Indeed, in this crate, implementations of `Automaton` |
| /// will only treat start states as "special" when a prefilter is enabled and |
| /// active. Otherwise, treating it as special has no purpose and winds up |
| /// slowing down the overall search because it results in ping-ponging between |
| /// the main state transition and the "special" state logic. |
| /// |
| /// Since checking whether a state is special by doing three different |
| /// checks would be too expensive inside a fast search loop, the |
| /// [`Automaton::is_special`] method is provided for quickly checking whether |
| /// the state is special. The `Automaton::is_dead`, `Automaton::is_match` and |
| /// `Automaton::is_start` predicates can then be used to determine which kind |
| /// of special state it is. |
| /// |
| /// # Panics |
| /// |
| /// Most of the APIs on this trait should panic or give incorrect results |
| /// if invalid inputs are given to it. For example, `Automaton::next_state` |
| /// has unspecified behavior if the state ID given to it is not a valid |
| /// state ID for the underlying automaton. Valid state IDs can only be |
| /// retrieved in one of two ways: calling `Automaton::start_state` or calling |
| /// `Automaton::next_state` with a valid state ID. |
| /// |
| /// # Safety |
| /// |
| /// This trait is not safe to implement so that code may rely on the |
| /// correctness of implementations of this trait to avoid undefined behavior. |
| /// The primary correctness guarantees are: |
| /// |
| /// * `Automaton::start_state` always returns a valid state ID or an error or |
| /// panics. |
| /// * `Automaton::next_state`, when given a valid state ID, always returns |
| /// a valid state ID for all values of `anchored` and `byte`, or otherwise |
| /// panics. |
| /// |
| /// In general, the rest of the methods on `Automaton` need to uphold their |
| /// contracts as well. For example, `Automaton::is_dead` should only returns |
| /// true if the given state ID is actually a dead state. |
| /// |
| /// Note that currently this crate does not rely on the safety property defined |
| /// here to avoid undefined behavior. Instead, this was done to make it |
| /// _possible_ to do in the future. |
| /// |
| /// # Example |
| /// |
| /// This example shows how one might implement a basic but correct search |
| /// routine. We keep things simple by not using prefilters or worrying about |
| /// anchored searches, but do make sure our search is correct for all possible |
| /// [`MatchKind`] semantics. (The comments in the code below note the parts |
| /// that are needed to support certain `MatchKind` semantics.) |
| /// |
| /// ``` |
| /// use aho_corasick::{ |
| /// automaton::Automaton, |
| /// nfa::noncontiguous::NFA, |
| /// Anchored, Match, MatchError, MatchKind, |
| /// }; |
| /// |
| /// // Run an unanchored search for 'aut' in 'haystack'. Return the first match |
| /// // seen according to the automaton's match semantics. This returns an error |
| /// // if the given automaton does not support unanchored searches. |
| /// fn find<A: Automaton>( |
| /// aut: A, |
| /// haystack: &[u8], |
| /// ) -> Result<Option<Match>, MatchError> { |
| /// let mut sid = aut.start_state(Anchored::No)?; |
| /// let mut at = 0; |
| /// let mut mat = None; |
| /// let get_match = |sid, at| { |
| /// let pid = aut.match_pattern(sid, 0); |
| /// let len = aut.pattern_len(pid); |
| /// Match::new(pid, (at - len)..at) |
| /// }; |
| /// // Start states can be match states! |
| /// if aut.is_match(sid) { |
| /// mat = Some(get_match(sid, at)); |
| /// // Standard semantics require matches to be reported as soon as |
| /// // they're seen. Otherwise, we continue until we see a dead state |
| /// // or the end of the haystack. |
| /// if matches!(aut.match_kind(), MatchKind::Standard) { |
| /// return Ok(mat); |
| /// } |
| /// } |
| /// while at < haystack.len() { |
| /// sid = aut.next_state(Anchored::No, sid, haystack[at]); |
| /// if aut.is_special(sid) { |
| /// if aut.is_dead(sid) { |
| /// return Ok(mat); |
| /// } else if aut.is_match(sid) { |
| /// mat = Some(get_match(sid, at + 1)); |
| /// // As above, standard semantics require that we return |
| /// // immediately once a match is found. |
| /// if matches!(aut.match_kind(), MatchKind::Standard) { |
| /// return Ok(mat); |
| /// } |
| /// } |
| /// } |
| /// at += 1; |
| /// } |
| /// Ok(mat) |
| /// } |
| /// |
| /// // Show that it works for standard searches. |
| /// let nfa = NFA::new(&["samwise", "sam"]).unwrap(); |
| /// assert_eq!(Some(Match::must(1, 0..3)), find(&nfa, b"samwise")?); |
| /// |
| /// // But also works when using leftmost-first. Notice how the match result |
| /// // has changed! |
| /// let nfa = NFA::builder() |
| /// .match_kind(MatchKind::LeftmostFirst) |
| /// .build(&["samwise", "sam"]) |
| /// .unwrap(); |
| /// assert_eq!(Some(Match::must(0, 0..7)), find(&nfa, b"samwise")?); |
| /// |
| /// # Ok::<(), Box<dyn std::error::Error>>(()) |
| /// ``` |
| pub unsafe trait Automaton: private::Sealed { |
| /// Returns the starting state for the given anchor mode. |
| /// |
| /// Upon success, the state ID returned is guaranteed to be valid for |
| /// this automaton. |
| /// |
| /// # Errors |
| /// |
| /// This returns an error when the given search configuration is not |
| /// supported by the underlying automaton. For example, if the underlying |
| /// automaton only supports unanchored searches but the given configuration |
| /// was set to an anchored search, then this must return an error. |
| fn start_state(&self, anchored: Anchored) -> Result<StateID, MatchError>; |
| |
| /// Performs a state transition from `sid` for `byte` and returns the next |
| /// state. |
| /// |
| /// `anchored` should be [`Anchored::Yes`] when executing an anchored |
| /// search and [`Anchored::No`] otherwise. For some implementations of |
| /// `Automaton`, it is required to know whether the search is anchored |
| /// or not in order to avoid following failure transitions. Other |
| /// implementations may ignore `anchored` altogether and depend on |
| /// `Automaton::start_state` returning a state that walks a different path |
| /// through the automaton depending on whether the search is anchored or |
| /// not. |
| /// |
| /// # Panics |
| /// |
| /// This routine may panic or return incorrect results when the given state |
| /// ID is invalid. A state ID is valid if and only if: |
| /// |
| /// 1. It came from a call to `Automaton::start_state`, or |
| /// 2. It came from a previous call to `Automaton::next_state` with a |
| /// valid state ID. |
| /// |
| /// Implementations must treat all possible values of `byte` as valid. |
| /// |
| /// Implementations may panic on unsupported values of `anchored`, but are |
| /// not required to do so. |
| fn next_state( |
| &self, |
| anchored: Anchored, |
| sid: StateID, |
| byte: u8, |
| ) -> StateID; |
| |
| /// Returns true if the given ID represents a "special" state. A special |
| /// state is a dead, match or start state. |
| /// |
| /// Note that implementations may choose to return false when the given ID |
| /// corresponds to a start state. Namely, it always correct to treat start |
| /// states as non-special. Implementations must return true for states that |
| /// are dead or contain matches. |
| /// |
| /// This has unspecified behavior when given an invalid state ID. |
| fn is_special(&self, sid: StateID) -> bool; |
| |
| /// Returns true if the given ID represents a dead state. |
| /// |
| /// A dead state is a type of "sink" in a finite state machine. It |
| /// corresponds to a state whose transitions all loop back to itself. That |
| /// is, once entered, it can never be left. In practice, it serves as a |
| /// sentinel indicating that the search should terminate. |
| /// |
| /// This has unspecified behavior when given an invalid state ID. |
| fn is_dead(&self, sid: StateID) -> bool; |
| |
| /// Returns true if the given ID represents a match state. |
| /// |
| /// A match state is always associated with one or more pattern IDs that |
| /// matched at the position in the haystack when the match state was |
| /// entered. When a match state is entered, the match semantics dictate |
| /// whether it should be returned immediately (for `MatchKind::Standard`) |
| /// or if the search should continue (for `MatchKind::LeftmostFirst` and |
| /// `MatchKind::LeftmostLongest`) until a dead state is seen or the end of |
| /// the haystack has been reached. |
| /// |
| /// This has unspecified behavior when given an invalid state ID. |
| fn is_match(&self, sid: StateID) -> bool; |
| |
| /// Returns true if the given ID represents a start state. |
| /// |
| /// While it is never incorrect to ignore start states during a search |
| /// (except for the start of the search of course), knowing whether one has |
| /// entered a start state can be useful for certain classes of performance |
| /// optimizations. For example, if one is in a start state, it may be legal |
| /// to try to skip ahead and look for match candidates more quickly than |
| /// would otherwise be accomplished by walking the automaton. |
| /// |
| /// Implementations of `Automaton` in this crate "unspecialize" start |
| /// states when a prefilter is not active or enabled. In this case, it |
| /// is possible for `Automaton::is_special(sid)` to return false while |
| /// `Automaton::is_start(sid)` returns true. |
| /// |
| /// This has unspecified behavior when given an invalid state ID. |
| fn is_start(&self, sid: StateID) -> bool; |
| |
| /// Returns the match semantics that this automaton was built with. |
| fn match_kind(&self) -> MatchKind; |
| |
| /// Returns the total number of matches for the given state ID. |
| /// |
| /// This has unspecified behavior if the given ID does not refer to a match |
| /// state. |
| fn match_len(&self, sid: StateID) -> usize; |
| |
| /// Returns the pattern ID for the match state given by `sid` at the |
| /// `index` given. |
| /// |
| /// Typically, `index` is only ever greater than `0` when implementing an |
| /// overlapping search. Otherwise, it's likely that your search only cares |
| /// about reporting the first pattern ID in a match state. |
| /// |
| /// This has unspecified behavior if the given ID does not refer to a match |
| /// state, or if the index is greater than or equal to the total number of |
| /// matches in this match state. |
| fn match_pattern(&self, sid: StateID, index: usize) -> PatternID; |
| |
| /// Returns the total number of patterns compiled into this automaton. |
| fn patterns_len(&self) -> usize; |
| |
| /// Returns the length of the pattern for the given ID. |
| /// |
| /// This has unspecified behavior when given an invalid pattern |
| /// ID. A pattern ID is valid if and only if it is less than |
| /// `Automaton::patterns_len`. |
| fn pattern_len(&self, pid: PatternID) -> usize; |
| |
| /// Returns the length, in bytes, of the shortest pattern in this |
| /// automaton. |
| fn min_pattern_len(&self) -> usize; |
| |
| /// Returns the length, in bytes, of the longest pattern in this automaton. |
| fn max_pattern_len(&self) -> usize; |
| |
| /// Returns the heap memory usage, in bytes, used by this automaton. |
| fn memory_usage(&self) -> usize; |
| |
| /// Returns a prefilter, if available, that can be used to accelerate |
| /// searches for this automaton. |
| /// |
| /// The typical way this is used is when the start state is entered during |
| /// a search. When that happens, one can use a prefilter to skip ahead and |
| /// look for candidate matches without having to walk the automaton on the |
| /// bytes between candidates. |
| /// |
| /// Typically a prefilter is only available when there are a small (<100) |
| /// number of patterns built into the automaton. |
| fn prefilter(&self) -> Option<&Prefilter>; |
| |
| /// Executes a non-overlapping search with this automaton using the given |
| /// configuration. |
| /// |
| /// See |
| /// [`AhoCorasick::try_find`](crate::AhoCorasick::try_find) |
| /// for more documentation and examples. |
| fn try_find( |
| &self, |
| input: &Input<'_>, |
| ) -> Result<Option<Match>, MatchError> { |
| try_find_fwd(&self, input) |
| } |
| |
| /// Executes a overlapping search with this automaton using the given |
| /// configuration. |
| /// |
| /// See |
| /// [`AhoCorasick::try_find_overlapping`](crate::AhoCorasick::try_find_overlapping) |
| /// for more documentation and examples. |
| fn try_find_overlapping( |
| &self, |
| input: &Input<'_>, |
| state: &mut OverlappingState, |
| ) -> Result<(), MatchError> { |
| try_find_overlapping_fwd(&self, input, state) |
| } |
| |
| /// Returns an iterator of non-overlapping matches with this automaton |
| /// using the given configuration. |
| /// |
| /// See |
| /// [`AhoCorasick::try_find_iter`](crate::AhoCorasick::try_find_iter) |
| /// for more documentation and examples. |
| fn try_find_iter<'a, 'h>( |
| &'a self, |
| input: Input<'h>, |
| ) -> Result<FindIter<'a, 'h, Self>, MatchError> |
| where |
| Self: Sized, |
| { |
| FindIter::new(self, input) |
| } |
| |
| /// Returns an iterator of overlapping matches with this automaton |
| /// using the given configuration. |
| /// |
| /// See |
| /// [`AhoCorasick::try_find_overlapping_iter`](crate::AhoCorasick::try_find_overlapping_iter) |
| /// for more documentation and examples. |
| fn try_find_overlapping_iter<'a, 'h>( |
| &'a self, |
| input: Input<'h>, |
| ) -> Result<FindOverlappingIter<'a, 'h, Self>, MatchError> |
| where |
| Self: Sized, |
| { |
| if !self.match_kind().is_standard() { |
| return Err(MatchError::unsupported_overlapping( |
| self.match_kind(), |
| )); |
| } |
| // We might consider lifting this restriction. The reason why I added |
| // it was to ban the combination of "anchored search" and "overlapping |
| // iteration." The match semantics aren't totally clear in that case. |
| // Should we allow *any* matches that are adjacent to *any* previous |
| // match? Or only following the most recent one? Or only matches |
| // that start at the beginning of the search? We might also elect to |
| // just keep this restriction in place, as callers should be able to |
| // implement it themselves if they want to. |
| if input.get_anchored().is_anchored() { |
| return Err(MatchError::invalid_input_anchored()); |
| } |
| let _ = self.start_state(input.get_anchored())?; |
| let state = OverlappingState::start(); |
| Ok(FindOverlappingIter { aut: self, input, state }) |
| } |
| |
| /// Replaces all non-overlapping matches in `haystack` with |
| /// strings from `replace_with` depending on the pattern that |
| /// matched. The `replace_with` slice must have length equal to |
| /// `Automaton::patterns_len`. |
| /// |
| /// See |
| /// [`AhoCorasick::try_replace_all`](crate::AhoCorasick::try_replace_all) |
| /// for more documentation and examples. |
| fn try_replace_all<B>( |
| &self, |
| haystack: &str, |
| replace_with: &[B], |
| ) -> Result<String, MatchError> |
| where |
| Self: Sized, |
| B: AsRef<str>, |
| { |
| assert_eq!( |
| replace_with.len(), |
| self.patterns_len(), |
| "replace_all requires a replacement for every pattern \ |
| in the automaton" |
| ); |
| let mut dst = String::with_capacity(haystack.len()); |
| self.try_replace_all_with(haystack, &mut dst, |mat, _, dst| { |
| dst.push_str(replace_with[mat.pattern()].as_ref()); |
| true |
| })?; |
| Ok(dst) |
| } |
| |
| /// Replaces all non-overlapping matches in `haystack` with |
| /// strings from `replace_with` depending on the pattern that |
| /// matched. The `replace_with` slice must have length equal to |
| /// `Automaton::patterns_len`. |
| /// |
| /// See |
| /// [`AhoCorasick::try_replace_all_bytes`](crate::AhoCorasick::try_replace_all_bytes) |
| /// for more documentation and examples. |
| fn try_replace_all_bytes<B>( |
| &self, |
| haystack: &[u8], |
| replace_with: &[B], |
| ) -> Result<Vec<u8>, MatchError> |
| where |
| Self: Sized, |
| B: AsRef<[u8]>, |
| { |
| assert_eq!( |
| replace_with.len(), |
| self.patterns_len(), |
| "replace_all requires a replacement for every pattern \ |
| in the automaton" |
| ); |
| let mut dst = Vec::with_capacity(haystack.len()); |
| self.try_replace_all_with_bytes(haystack, &mut dst, |mat, _, dst| { |
| dst.extend(replace_with[mat.pattern()].as_ref()); |
| true |
| })?; |
| Ok(dst) |
| } |
| |
| /// Replaces all non-overlapping matches in `haystack` by calling the |
| /// `replace_with` closure given. |
| /// |
| /// See |
| /// [`AhoCorasick::try_replace_all_with`](crate::AhoCorasick::try_replace_all_with) |
| /// for more documentation and examples. |
| fn try_replace_all_with<F>( |
| &self, |
| haystack: &str, |
| dst: &mut String, |
| mut replace_with: F, |
| ) -> Result<(), MatchError> |
| where |
| Self: Sized, |
| F: FnMut(&Match, &str, &mut String) -> bool, |
| { |
| let mut last_match = 0; |
| for m in self.try_find_iter(Input::new(haystack))? { |
| // Since there are no restrictions on what kinds of patterns are |
| // in an Aho-Corasick automaton, we might get matches that split |
| // a codepoint, or even matches of a partial codepoint. When that |
| // happens, we just skip the match. |
| if !haystack.is_char_boundary(m.start()) |
| || !haystack.is_char_boundary(m.end()) |
| { |
| continue; |
| } |
| dst.push_str(&haystack[last_match..m.start()]); |
| last_match = m.end(); |
| if !replace_with(&m, &haystack[m.start()..m.end()], dst) { |
| break; |
| }; |
| } |
| dst.push_str(&haystack[last_match..]); |
| Ok(()) |
| } |
| |
| /// Replaces all non-overlapping matches in `haystack` by calling the |
| /// `replace_with` closure given. |
| /// |
| /// See |
| /// [`AhoCorasick::try_replace_all_with_bytes`](crate::AhoCorasick::try_replace_all_with_bytes) |
| /// for more documentation and examples. |
| fn try_replace_all_with_bytes<F>( |
| &self, |
| haystack: &[u8], |
| dst: &mut Vec<u8>, |
| mut replace_with: F, |
| ) -> Result<(), MatchError> |
| where |
| Self: Sized, |
| F: FnMut(&Match, &[u8], &mut Vec<u8>) -> bool, |
| { |
| let mut last_match = 0; |
| for m in self.try_find_iter(Input::new(haystack))? { |
| dst.extend(&haystack[last_match..m.start()]); |
| last_match = m.end(); |
| if !replace_with(&m, &haystack[m.start()..m.end()], dst) { |
| break; |
| }; |
| } |
| dst.extend(&haystack[last_match..]); |
| Ok(()) |
| } |
| |
| /// Returns an iterator of non-overlapping matches with this automaton |
| /// from the stream given. |
| /// |
| /// See |
| /// [`AhoCorasick::try_stream_find_iter`](crate::AhoCorasick::try_stream_find_iter) |
| /// for more documentation and examples. |
| #[cfg(feature = "std")] |
| fn try_stream_find_iter<'a, R: std::io::Read>( |
| &'a self, |
| rdr: R, |
| ) -> Result<StreamFindIter<'a, Self, R>, MatchError> |
| where |
| Self: Sized, |
| { |
| Ok(StreamFindIter { it: StreamChunkIter::new(self, rdr)? }) |
| } |
| |
| /// Replaces all non-overlapping matches in `rdr` with strings from |
| /// `replace_with` depending on the pattern that matched, and writes the |
| /// result to `wtr`. The `replace_with` slice must have length equal to |
| /// `Automaton::patterns_len`. |
| /// |
| /// See |
| /// [`AhoCorasick::try_stream_replace_all`](crate::AhoCorasick::try_stream_replace_all) |
| /// for more documentation and examples. |
| #[cfg(feature = "std")] |
| fn try_stream_replace_all<R, W, B>( |
| &self, |
| rdr: R, |
| wtr: W, |
| replace_with: &[B], |
| ) -> std::io::Result<()> |
| where |
| Self: Sized, |
| R: std::io::Read, |
| W: std::io::Write, |
| B: AsRef<[u8]>, |
| { |
| assert_eq!( |
| replace_with.len(), |
| self.patterns_len(), |
| "streaming replace_all requires a replacement for every pattern \ |
| in the automaton", |
| ); |
| self.try_stream_replace_all_with(rdr, wtr, |mat, _, wtr| { |
| wtr.write_all(replace_with[mat.pattern()].as_ref()) |
| }) |
| } |
| |
| /// Replaces all non-overlapping matches in `rdr` by calling the |
| /// `replace_with` closure given and writing the result to `wtr`. |
| /// |
| /// See |
| /// [`AhoCorasick::try_stream_replace_all_with`](crate::AhoCorasick::try_stream_replace_all_with) |
| /// for more documentation and examples. |
| #[cfg(feature = "std")] |
| fn try_stream_replace_all_with<R, W, F>( |
| &self, |
| rdr: R, |
| mut wtr: W, |
| mut replace_with: F, |
| ) -> std::io::Result<()> |
| where |
| Self: Sized, |
| R: std::io::Read, |
| W: std::io::Write, |
| F: FnMut(&Match, &[u8], &mut W) -> std::io::Result<()>, |
| { |
| let mut it = StreamChunkIter::new(self, rdr).map_err(|e| { |
| let kind = std::io::ErrorKind::Other; |
| std::io::Error::new(kind, e) |
| })?; |
| while let Some(result) = it.next() { |
| let chunk = result?; |
| match chunk { |
| StreamChunk::NonMatch { bytes, .. } => { |
| wtr.write_all(bytes)?; |
| } |
| StreamChunk::Match { bytes, mat } => { |
| replace_with(&mat, bytes, &mut wtr)?; |
| } |
| } |
| } |
| Ok(()) |
| } |
| } |
| |
| // SAFETY: This just defers to the underlying 'AcAutomaton' and thus inherits |
| // its safety properties. |
| unsafe impl<'a, A: Automaton + ?Sized> Automaton for &'a A { |
| #[inline(always)] |
| fn start_state(&self, anchored: Anchored) -> Result<StateID, MatchError> { |
| (**self).start_state(anchored) |
| } |
| |
| #[inline(always)] |
| fn next_state( |
| &self, |
| anchored: Anchored, |
| sid: StateID, |
| byte: u8, |
| ) -> StateID { |
| (**self).next_state(anchored, sid, byte) |
| } |
| |
| #[inline(always)] |
| fn is_special(&self, sid: StateID) -> bool { |
| (**self).is_special(sid) |
| } |
| |
| #[inline(always)] |
| fn is_dead(&self, sid: StateID) -> bool { |
| (**self).is_dead(sid) |
| } |
| |
| #[inline(always)] |
| fn is_match(&self, sid: StateID) -> bool { |
| (**self).is_match(sid) |
| } |
| |
| #[inline(always)] |
| fn is_start(&self, sid: StateID) -> bool { |
| (**self).is_start(sid) |
| } |
| |
| #[inline(always)] |
| fn match_kind(&self) -> MatchKind { |
| (**self).match_kind() |
| } |
| |
| #[inline(always)] |
| fn match_len(&self, sid: StateID) -> usize { |
| (**self).match_len(sid) |
| } |
| |
| #[inline(always)] |
| fn match_pattern(&self, sid: StateID, index: usize) -> PatternID { |
| (**self).match_pattern(sid, index) |
| } |
| |
| #[inline(always)] |
| fn patterns_len(&self) -> usize { |
| (**self).patterns_len() |
| } |
| |
| #[inline(always)] |
| fn pattern_len(&self, pid: PatternID) -> usize { |
| (**self).pattern_len(pid) |
| } |
| |
| #[inline(always)] |
| fn min_pattern_len(&self) -> usize { |
| (**self).min_pattern_len() |
| } |
| |
| #[inline(always)] |
| fn max_pattern_len(&self) -> usize { |
| (**self).max_pattern_len() |
| } |
| |
| #[inline(always)] |
| fn memory_usage(&self) -> usize { |
| (**self).memory_usage() |
| } |
| |
| #[inline(always)] |
| fn prefilter(&self) -> Option<&Prefilter> { |
| (**self).prefilter() |
| } |
| } |
| |
| /// Represents the current state of an overlapping search. |
| /// |
| /// This is used for overlapping searches since they need to know something |
| /// about the previous search. For example, when multiple patterns match at the |
| /// same position, this state tracks the last reported pattern so that the next |
| /// search knows whether to report another matching pattern or continue with |
| /// the search at the next position. Additionally, it also tracks which state |
| /// the last search call terminated in and the current offset of the search |
| /// in the haystack. |
| /// |
| /// This type provides limited introspection capabilities. The only thing a |
| /// caller can do is construct it and pass it around to permit search routines |
| /// to use it to track state, and to ask whether a match has been found. |
| /// |
| /// Callers should always provide a fresh state constructed via |
| /// [`OverlappingState::start`] when starting a new search. That same state |
| /// should be reused for subsequent searches on the same `Input`. The state |
| /// given will advance through the haystack itself. Callers can detect the end |
| /// of a search when neither an error nor a match is returned. |
| /// |
| /// # Example |
| /// |
| /// This example shows how to manually iterate over all overlapping matches. If |
| /// you need this, you might consider using |
| /// [`AhoCorasick::find_overlapping_iter`](crate::AhoCorasick::find_overlapping_iter) |
| /// instead, but this shows how to correctly use an `OverlappingState`. |
| /// |
| /// ``` |
| /// use aho_corasick::{ |
| /// automaton::OverlappingState, |
| /// AhoCorasick, Input, Match, |
| /// }; |
| /// |
| /// let patterns = &["append", "appendage", "app"]; |
| /// let haystack = "append the app to the appendage"; |
| /// |
| /// let ac = AhoCorasick::new(patterns).unwrap(); |
| /// let mut state = OverlappingState::start(); |
| /// let mut matches = vec![]; |
| /// |
| /// loop { |
| /// ac.find_overlapping(haystack, &mut state); |
| /// let mat = match state.get_match() { |
| /// None => break, |
| /// Some(mat) => mat, |
| /// }; |
| /// matches.push(mat); |
| /// } |
| /// let expected = vec![ |
| /// Match::must(2, 0..3), |
| /// Match::must(0, 0..6), |
| /// Match::must(2, 11..14), |
| /// Match::must(2, 22..25), |
| /// Match::must(0, 22..28), |
| /// Match::must(1, 22..31), |
| /// ]; |
| /// assert_eq!(expected, matches); |
| /// ``` |
| #[derive(Clone, Debug)] |
| pub struct OverlappingState { |
| /// The match reported by the most recent overlapping search to use this |
| /// state. |
| /// |
| /// If a search does not find any matches, then it is expected to clear |
| /// this value. |
| mat: Option<Match>, |
| /// The state ID of the state at which the search was in when the call |
| /// terminated. When this is a match state, `last_match` must be set to a |
| /// non-None value. |
| /// |
| /// A `None` value indicates the start state of the corresponding |
| /// automaton. We cannot use the actual ID, since any one automaton may |
| /// have many start states, and which one is in use depends on search-time |
| /// factors (such as whether the search is anchored or not). |
| id: Option<StateID>, |
| /// The position of the search. |
| /// |
| /// When `id` is None (i.e., we are starting a search), this is set to |
| /// the beginning of the search as given by the caller regardless of its |
| /// current value. Subsequent calls to an overlapping search pick up at |
| /// this offset. |
| at: usize, |
| /// The index into the matching patterns of the next match to report if the |
| /// current state is a match state. Note that this may be 1 greater than |
| /// the total number of matches to report for the current match state. (In |
| /// which case, no more matches should be reported at the current position |
| /// and the search should advance to the next position.) |
| next_match_index: Option<usize>, |
| } |
| |
| impl OverlappingState { |
| /// Create a new overlapping state that begins at the start state. |
| pub fn start() -> OverlappingState { |
| OverlappingState { mat: None, id: None, at: 0, next_match_index: None } |
| } |
| |
| /// Return the match result of the most recent search to execute with this |
| /// state. |
| /// |
| /// Every search will clear this result automatically, such that if no |
| /// match is found, this will always correctly report `None`. |
| pub fn get_match(&self) -> Option<Match> { |
| self.mat |
| } |
| } |
| |
| /// An iterator of non-overlapping matches in a particular haystack. |
| /// |
| /// This iterator yields matches according to the [`MatchKind`] used by this |
| /// automaton. |
| /// |
| /// This iterator is constructed via the [`Automaton::try_find_iter`] method. |
| /// |
| /// The type variable `A` refers to the implementation of the [`Automaton`] |
| /// trait used to execute the search. |
| /// |
| /// The lifetime `'a` refers to the lifetime of the [`Automaton`] |
| /// implementation. |
| /// |
| /// The lifetime `'h` refers to the lifetime of the haystack being searched. |
| #[derive(Debug)] |
| pub struct FindIter<'a, 'h, A> { |
| /// The automaton used to drive the search. |
| aut: &'a A, |
| /// The input parameters to give to each search call. |
| /// |
| /// The start position of the search is mutated during iteration. |
| input: Input<'h>, |
| /// Records the end offset of the most recent match. This is necessary to |
| /// handle a corner case for preventing empty matches from overlapping with |
| /// the ending bounds of a prior match. |
| last_match_end: Option<usize>, |
| } |
| |
| impl<'a, 'h, A: Automaton> FindIter<'a, 'h, A> { |
| /// Creates a new non-overlapping iterator. If the given automaton would |
| /// return an error on a search with the given input configuration, then |
| /// that error is returned here. |
| fn new( |
| aut: &'a A, |
| input: Input<'h>, |
| ) -> Result<FindIter<'a, 'h, A>, MatchError> { |
| // The only way this search can fail is if we cannot retrieve the start |
| // state. e.g., Asking for an anchored search when only unanchored |
| // searches are supported. |
| let _ = aut.start_state(input.get_anchored())?; |
| Ok(FindIter { aut, input, last_match_end: None }) |
| } |
| |
| /// Executes a search and returns a match if one is found. |
| /// |
| /// This does not advance the input forward. It just executes a search |
| /// based on the current configuration/offsets. |
| fn search(&self) -> Option<Match> { |
| // The unwrap is OK here because we check at iterator construction time |
| // that no subsequent search call (using the same configuration) will |
| // ever return an error. |
| self.aut |
| .try_find(&self.input) |
| .expect("already checked that no match error can occur") |
| } |
| |
| /// Handles the special case of an empty match by ensuring that 1) the |
| /// iterator always advances and 2) empty matches never overlap with other |
| /// matches. |
| /// |
| /// (1) is necessary because we principally make progress by setting the |
| /// starting location of the next search to the ending location of the last |
| /// match. But if a match is empty, then this results in a search that does |
| /// not advance and thus does not terminate. |
| /// |
| /// (2) is not strictly necessary, but makes intuitive sense and matches |
| /// the presiding behavior of most general purpose regex engines. |
| /// (Obviously this crate isn't a regex engine, but we choose to match |
| /// their semantics.) The "intuitive sense" here is that we want to report |
| /// NON-overlapping matches. So for example, given the patterns 'a' and |
| /// '' (an empty string) against the haystack 'a', without the special |
| /// handling, you'd get the matches [0, 1) and [1, 1), where the latter |
| /// overlaps with the end bounds of the former. |
| /// |
| /// Note that we mark this cold and forcefully prevent inlining because |
| /// handling empty matches like this is extremely rare and does require |
| /// quite a bit of code, comparatively. Keeping this code out of the main |
| /// iterator function keeps it smaller and more amenable to inlining |
| /// itself. |
| #[cold] |
| #[inline(never)] |
| fn handle_overlapping_empty_match( |
| &mut self, |
| mut m: Match, |
| ) -> Option<Match> { |
| assert!(m.is_empty()); |
| if Some(m.end()) == self.last_match_end { |
| self.input.set_start(self.input.start().checked_add(1).unwrap()); |
| m = self.search()?; |
| } |
| Some(m) |
| } |
| } |
| |
| impl<'a, 'h, A: Automaton> Iterator for FindIter<'a, 'h, A> { |
| type Item = Match; |
| |
| #[inline(always)] |
| fn next(&mut self) -> Option<Match> { |
| let mut m = self.search()?; |
| if m.is_empty() { |
| m = self.handle_overlapping_empty_match(m)?; |
| } |
| self.input.set_start(m.end()); |
| self.last_match_end = Some(m.end()); |
| Some(m) |
| } |
| } |
| |
| /// An iterator of overlapping matches in a particular haystack. |
| /// |
| /// This iterator will report all possible matches in a particular haystack, |
| /// even when the matches overlap. |
| /// |
| /// This iterator is constructed via the |
| /// [`Automaton::try_find_overlapping_iter`] method. |
| /// |
| /// The type variable `A` refers to the implementation of the [`Automaton`] |
| /// trait used to execute the search. |
| /// |
| /// The lifetime `'a` refers to the lifetime of the [`Automaton`] |
| /// implementation. |
| /// |
| /// The lifetime `'h` refers to the lifetime of the haystack being searched. |
| #[derive(Debug)] |
| pub struct FindOverlappingIter<'a, 'h, A> { |
| aut: &'a A, |
| input: Input<'h>, |
| state: OverlappingState, |
| } |
| |
| impl<'a, 'h, A: Automaton> Iterator for FindOverlappingIter<'a, 'h, A> { |
| type Item = Match; |
| |
| #[inline(always)] |
| fn next(&mut self) -> Option<Match> { |
| self.aut |
| .try_find_overlapping(&self.input, &mut self.state) |
| .expect("already checked that no match error can occur here"); |
| self.state.get_match() |
| } |
| } |
| |
| /// An iterator that reports matches in a stream. |
| /// |
| /// This iterator yields elements of type `io::Result<Match>`, where an error |
| /// is reported if there was a problem reading from the underlying stream. |
| /// The iterator terminates only when the underlying stream reaches `EOF`. |
| /// |
| /// This iterator is constructed via the [`Automaton::try_stream_find_iter`] |
| /// method. |
| /// |
| /// The type variable `A` refers to the implementation of the [`Automaton`] |
| /// trait used to execute the search. |
| /// |
| /// The type variable `R` refers to the `io::Read` stream that is being read |
| /// from. |
| /// |
| /// The lifetime `'a` refers to the lifetime of the [`Automaton`] |
| /// implementation. |
| #[cfg(feature = "std")] |
| #[derive(Debug)] |
| pub struct StreamFindIter<'a, A, R> { |
| it: StreamChunkIter<'a, A, R>, |
| } |
| |
| #[cfg(feature = "std")] |
| impl<'a, A: Automaton, R: std::io::Read> Iterator |
| for StreamFindIter<'a, A, R> |
| { |
| type Item = std::io::Result<Match>; |
| |
| fn next(&mut self) -> Option<std::io::Result<Match>> { |
| loop { |
| match self.it.next() { |
| None => return None, |
| Some(Err(err)) => return Some(Err(err)), |
| Some(Ok(StreamChunk::NonMatch { .. })) => {} |
| Some(Ok(StreamChunk::Match { mat, .. })) => { |
| return Some(Ok(mat)); |
| } |
| } |
| } |
| } |
| } |
| |
| /// An iterator that reports matches in a stream. |
| /// |
| /// (This doesn't actually implement the `Iterator` trait because it returns |
| /// something with a lifetime attached to a buffer it owns, but that's OK. It |
| /// still has a `next` method and is iterator-like enough to be fine.) |
| /// |
| /// This iterator yields elements of type `io::Result<StreamChunk>`, where |
| /// an error is reported if there was a problem reading from the underlying |
| /// stream. The iterator terminates only when the underlying stream reaches |
| /// `EOF`. |
| /// |
| /// The idea here is that each chunk represents either a match or a non-match, |
| /// and if you concatenated all of the chunks together, you'd reproduce the |
| /// entire contents of the stream, byte-for-byte. |
| /// |
| /// This chunk machinery is a bit complicated and it isn't strictly required |
| /// for a stream searcher that just reports matches. But we do need something |
| /// like this to deal with the "replacement" API, which needs to know which |
| /// chunks it can copy and which it needs to replace. |
| #[cfg(feature = "std")] |
| #[derive(Debug)] |
| struct StreamChunkIter<'a, A, R> { |
| /// The underlying automaton to do the search. |
| aut: &'a A, |
| /// The source of bytes we read from. |
| rdr: R, |
| /// A roll buffer for managing bytes from `rdr`. Basically, this is used |
| /// to handle the case of a match that is split by two different |
| /// calls to `rdr.read()`. This isn't strictly needed if all we needed to |
| /// do was report matches, but here we are reporting chunks of non-matches |
| /// and matches and in order to do that, we really just cannot treat our |
| /// stream as non-overlapping blocks of bytes. We need to permit some |
| /// overlap while we retain bytes from a previous `read` call in memory. |
| buf: crate::util::buffer::Buffer, |
| /// The unanchored starting state of this automaton. |
| start: StateID, |
| /// The state of the automaton. |
| sid: StateID, |
| /// The absolute position over the entire stream. |
| absolute_pos: usize, |
| /// The position we're currently at within `buf`. |
| buffer_pos: usize, |
| /// The buffer position of the end of the bytes that we last returned |
| /// to the caller. Basically, whenever we find a match, we look to see if |
| /// there is a difference between where the match started and the position |
| /// of the last byte we returned to the caller. If there's a difference, |
| /// then we need to return a 'NonMatch' chunk. |
| buffer_reported_pos: usize, |
| } |
| |
| #[cfg(feature = "std")] |
| impl<'a, A: Automaton, R: std::io::Read> StreamChunkIter<'a, A, R> { |
| fn new( |
| aut: &'a A, |
| rdr: R, |
| ) -> Result<StreamChunkIter<'a, A, R>, MatchError> { |
| // This restriction is a carry-over from older versions of this crate. |
| // I didn't have the bandwidth to think through how to handle, say, |
| // leftmost-first or leftmost-longest matching, but... it should be |
| // possible? The main problem is that once you see a match state in |
| // leftmost-first semantics, you can't just stop at that point and |
| // report a match. You have to keep going until you either hit a dead |
| // state or EOF. So how do you know when you'll hit a dead state? Well, |
| // you don't. With Aho-Corasick, I believe you can put a bound on it |
| // and say, "once a match has been seen, you'll need to scan forward at |
| // most N bytes" where N=aut.max_pattern_len(). |
| // |
| // Which is fine, but it does mean that state about whether we're still |
| // looking for a dead state or not needs to persist across buffer |
| // refills. Which this code doesn't really handle. It does preserve |
| // *some* state across buffer refills, basically ensuring that a match |
| // span is always in memory. |
| if !aut.match_kind().is_standard() { |
| return Err(MatchError::unsupported_stream(aut.match_kind())); |
| } |
| // This is kind of a cop-out, but empty matches are SUPER annoying. |
| // If we know they can't happen (which is what we enforce here), then |
| // it makes a lot of logic much simpler. With that said, I'm open to |
| // supporting this case, but we need to define proper semantics for it |
| // first. It wasn't totally clear to me what it should do at the time |
| // of writing, so I decided to just be conservative. |
| // |
| // It also seems like a very weird case to support anyway. Why search a |
| // stream if you're just going to get a match at every position? |
| // |
| // ¯\_(ツ)_/¯ |
| if aut.min_pattern_len() == 0 { |
| return Err(MatchError::unsupported_empty()); |
| } |
| let start = aut.start_state(Anchored::No)?; |
| Ok(StreamChunkIter { |
| aut, |
| rdr, |
| buf: crate::util::buffer::Buffer::new(aut.max_pattern_len()), |
| start, |
| sid: start, |
| absolute_pos: 0, |
| buffer_pos: 0, |
| buffer_reported_pos: 0, |
| }) |
| } |
| |
| fn next(&mut self) -> Option<std::io::Result<StreamChunk>> { |
| // This code is pretty gnarly. It IS simpler than the equivalent code |
| // in the previous aho-corasick release, in part because we inline |
| // automaton traversal here and also in part because we have abdicated |
| // support for automatons that contain an empty pattern. |
| // |
| // I suspect this code could be made a bit simpler by designing a |
| // better buffer abstraction. |
| // |
| // But in general, this code is basically write-only. So you'll need |
| // to go through it step-by-step to grok it. One of the key bits of |
| // complexity is tracking a few different offsets. 'buffer_pos' is |
| // where we are in the buffer for search. 'buffer_reported_pos' is the |
| // position immediately following the last byte in the buffer that |
| // we've returned to the caller. And 'absolute_pos' is the overall |
| // current absolute position of the search in the entire stream, and |
| // this is what match spans are reported in terms of. |
| loop { |
| if self.aut.is_match(self.sid) { |
| let mat = self.get_match(); |
| if let Some(r) = self.get_non_match_chunk(mat) { |
| self.buffer_reported_pos += r.len(); |
| let bytes = &self.buf.buffer()[r]; |
| return Some(Ok(StreamChunk::NonMatch { bytes })); |
| } |
| self.sid = self.start; |
| let r = self.get_match_chunk(mat); |
| self.buffer_reported_pos += r.len(); |
| let bytes = &self.buf.buffer()[r]; |
| return Some(Ok(StreamChunk::Match { bytes, mat })); |
| } |
| if self.buffer_pos >= self.buf.buffer().len() { |
| if let Some(r) = self.get_pre_roll_non_match_chunk() { |
| self.buffer_reported_pos += r.len(); |
| let bytes = &self.buf.buffer()[r]; |
| return Some(Ok(StreamChunk::NonMatch { bytes })); |
| } |
| if self.buf.buffer().len() >= self.buf.min_buffer_len() { |
| self.buffer_pos = self.buf.min_buffer_len(); |
| self.buffer_reported_pos -= |
| self.buf.buffer().len() - self.buf.min_buffer_len(); |
| self.buf.roll(); |
| } |
| match self.buf.fill(&mut self.rdr) { |
| Err(err) => return Some(Err(err)), |
| Ok(true) => {} |
| Ok(false) => { |
| // We've hit EOF, but if there are still some |
| // unreported bytes remaining, return them now. |
| if let Some(r) = self.get_eof_non_match_chunk() { |
| self.buffer_reported_pos += r.len(); |
| let bytes = &self.buf.buffer()[r]; |
| return Some(Ok(StreamChunk::NonMatch { bytes })); |
| } |
| // We've reported everything! |
| return None; |
| } |
| } |
| } |
| let start = self.absolute_pos; |
| for &byte in self.buf.buffer()[self.buffer_pos..].iter() { |
| self.sid = self.aut.next_state(Anchored::No, self.sid, byte); |
| self.absolute_pos += 1; |
| if self.aut.is_match(self.sid) { |
| break; |
| } |
| } |
| self.buffer_pos += self.absolute_pos - start; |
| } |
| } |
| |
| /// Return a match chunk for the given match. It is assumed that the match |
| /// ends at the current `buffer_pos`. |
| fn get_match_chunk(&self, mat: Match) -> core::ops::Range<usize> { |
| let start = self.buffer_pos - mat.len(); |
| let end = self.buffer_pos; |
| start..end |
| } |
| |
| /// Return a non-match chunk, if necessary, just before reporting a match. |
| /// This returns `None` if there is nothing to report. Otherwise, this |
| /// assumes that the given match ends at the current `buffer_pos`. |
| fn get_non_match_chunk( |
| &self, |
| mat: Match, |
| ) -> Option<core::ops::Range<usize>> { |
| let buffer_mat_start = self.buffer_pos - mat.len(); |
| if buffer_mat_start > self.buffer_reported_pos { |
| let start = self.buffer_reported_pos; |
| let end = buffer_mat_start; |
| return Some(start..end); |
| } |
| None |
| } |
| |
| /// Look for any bytes that should be reported as a non-match just before |
| /// rolling the buffer. |
| /// |
| /// Note that this only reports bytes up to `buffer.len() - |
| /// min_buffer_len`, as it's not possible to know whether the bytes |
| /// following that will participate in a match or not. |
| fn get_pre_roll_non_match_chunk(&self) -> Option<core::ops::Range<usize>> { |
| let end = |
| self.buf.buffer().len().saturating_sub(self.buf.min_buffer_len()); |
| if self.buffer_reported_pos < end { |
| return Some(self.buffer_reported_pos..end); |
| } |
| None |
| } |
| |
| /// Return any unreported bytes as a non-match up to the end of the buffer. |
| /// |
| /// This should only be called when the entire contents of the buffer have |
| /// been searched and EOF has been hit when trying to fill the buffer. |
| fn get_eof_non_match_chunk(&self) -> Option<core::ops::Range<usize>> { |
| if self.buffer_reported_pos < self.buf.buffer().len() { |
| return Some(self.buffer_reported_pos..self.buf.buffer().len()); |
| } |
| None |
| } |
| |
| /// Return the match at the current position for the current state. |
| /// |
| /// This panics if `self.aut.is_match(self.sid)` isn't true. |
| fn get_match(&self) -> Match { |
| get_match(self.aut, self.sid, 0, self.absolute_pos) |
| } |
| } |
| |
| /// A single chunk yielded by the stream chunk iterator. |
| /// |
| /// The `'r` lifetime refers to the lifetime of the stream chunk iterator. |
| #[cfg(feature = "std")] |
| #[derive(Debug)] |
| enum StreamChunk<'r> { |
| /// A chunk that does not contain any matches. |
| NonMatch { bytes: &'r [u8] }, |
| /// A chunk that precisely contains a match. |
| Match { bytes: &'r [u8], mat: Match }, |
| } |
| |
| #[inline(never)] |
| pub(crate) fn try_find_fwd<A: Automaton + ?Sized>( |
| aut: &A, |
| input: &Input<'_>, |
| ) -> Result<Option<Match>, MatchError> { |
| if input.is_done() { |
| return Ok(None); |
| } |
| let earliest = aut.match_kind().is_standard() || input.get_earliest(); |
| if input.get_anchored().is_anchored() { |
| try_find_fwd_imp(aut, input, None, Anchored::Yes, earliest) |
| } else if let Some(pre) = aut.prefilter() { |
| if earliest { |
| try_find_fwd_imp(aut, input, Some(pre), Anchored::No, true) |
| } else { |
| try_find_fwd_imp(aut, input, Some(pre), Anchored::No, false) |
| } |
| } else { |
| if earliest { |
| try_find_fwd_imp(aut, input, None, Anchored::No, true) |
| } else { |
| try_find_fwd_imp(aut, input, None, Anchored::No, false) |
| } |
| } |
| } |
| |
| #[inline(always)] |
| fn try_find_fwd_imp<A: Automaton + ?Sized>( |
| aut: &A, |
| input: &Input<'_>, |
| pre: Option<&Prefilter>, |
| anchored: Anchored, |
| earliest: bool, |
| ) -> Result<Option<Match>, MatchError> { |
| let mut sid = aut.start_state(input.get_anchored())?; |
| let mut at = input.start(); |
| let mut mat = None; |
| if aut.is_match(sid) { |
| mat = Some(get_match(aut, sid, 0, at)); |
| if earliest { |
| return Ok(mat); |
| } |
| } |
| if let Some(pre) = pre { |
| match pre.find_in(input.haystack(), input.get_span()) { |
| Candidate::None => return Ok(None), |
| Candidate::Match(m) => return Ok(Some(m)), |
| Candidate::PossibleStartOfMatch(i) => { |
| at = i; |
| } |
| } |
| } |
| while at < input.end() { |
| // I've tried unrolling this loop and eliding bounds checks, but no |
| // matter what I did, I could not observe a consistent improvement on |
| // any benchmark I could devise. (If someone wants to re-litigate this, |
| // the way to do it is to add an 'next_state_unchecked' method to the |
| // 'Automaton' trait with a default impl that uses 'next_state'. Then |
| // use 'aut.next_state_unchecked' here and implement it on DFA using |
| // unchecked slice index acces.) |
| sid = aut.next_state(anchored, sid, input.haystack()[at]); |
| if aut.is_special(sid) { |
| if aut.is_dead(sid) { |
| return Ok(mat); |
| } else if aut.is_match(sid) { |
| // We use 'at + 1' here because the match state is entered |
| // at the last byte of the pattern. Since we use half-open |
| // intervals, the end of the range of the match is one past the |
| // last byte. |
| let m = get_match(aut, sid, 0, at + 1); |
| // For the automata in this crate, we make a size trade off |
| // where we reuse the same automaton for both anchored and |
| // unanchored searches. We achieve this, principally, by simply |
| // not following failure transitions while computing the next |
| // state. Instead, if we fail to find the next state, we return |
| // a dead state, which instructs the search to stop. (This |
| // is why 'next_state' needs to know whether the search is |
| // anchored or not.) In addition, we have different start |
| // states for anchored and unanchored searches. The latter has |
| // a self-loop where as the former does not. |
| // |
| // In this way, we can use the same trie to execute both |
| // anchored and unanchored searches. There is a catch though. |
| // When building an Aho-Corasick automaton for unanchored |
| // searches, we copy matches from match states to other states |
| // (which would otherwise not be match states) if they are |
| // reachable via a failure transition. In the case of an |
| // anchored search, we *specifically* do not want to report |
| // these matches because they represent matches that start past |
| // the beginning of the search. |
| // |
| // Now we could tweak the automaton somehow to differentiate |
| // anchored from unanchored match states, but this would make |
| // 'aut.is_match' and potentially 'aut.is_special' slower. And |
| // also make the automaton itself more complex. |
| // |
| // Instead, we insert a special hack: if the search is |
| // anchored, we simply ignore matches that don't begin at |
| // the start of the search. This is not quite ideal, but we |
| // do specialize this function in such a way that unanchored |
| // searches don't pay for this additional branch. While this |
| // might cause a search to continue on for more than it |
| // otherwise optimally would, it will be no more than the |
| // longest pattern in the automaton. The reason for this is |
| // that we ensure we don't follow failure transitions during |
| // an anchored search. Combined with using a different anchored |
| // starting state with no self-loop, we guarantee that we'll |
| // at worst move through a number of transitions equal to the |
| // longest pattern. |
| // |
| // Now for DFAs, the whole point of them is to eliminate |
| // failure transitions entirely. So there is no way to say "if |
| // it's an anchored search don't follow failure transitions." |
| // Instead, we actually have to build two entirely separate |
| // automatons into the transition table. One with failure |
| // transitions built into it and another that is effectively |
| // just an encoding of the base trie into a transition table. |
| // DFAs still need this check though, because the match states |
| // still carry matches only reachable via a failure transition. |
| // Why? Because removing them seems difficult, although I |
| // haven't given it a lot of thought. |
| if !(anchored.is_anchored() && m.start() > input.start()) { |
| mat = Some(m); |
| if earliest { |
| return Ok(mat); |
| } |
| } |
| } else if let Some(pre) = pre { |
| // If we're here, we know it's a special state that is not a |
| // dead or a match state AND that a prefilter is active. Thus, |
| // it must be a start state. |
| debug_assert!(aut.is_start(sid)); |
| // We don't care about 'Candidate::Match' here because if such |
| // a match were possible, it would have been returned above |
| // when we run the prefilter before walking the automaton. |
| let span = Span::from(at..input.end()); |
| match pre.find_in(input.haystack(), span).into_option() { |
| None => return Ok(None), |
| Some(i) => { |
| if i > at { |
| at = i; |
| continue; |
| } |
| } |
| } |
| } else { |
| // When pre.is_none(), then starting states should not be |
| // treated as special. That is, without a prefilter, is_special |
| // should only return true when the state is a dead or a match |
| // state. |
| // |
| // It is possible to execute a search without a prefilter even |
| // when the underlying searcher has one: an anchored search. |
| // But in this case, the automaton makes it impossible to move |
| // back to the start state by construction, and thus, we should |
| // never reach this branch. |
| debug_assert!(false, "unreachable"); |
| } |
| } |
| at += 1; |
| } |
| Ok(mat) |
| } |
| |
| #[inline(never)] |
| fn try_find_overlapping_fwd<A: Automaton + ?Sized>( |
| aut: &A, |
| input: &Input<'_>, |
| state: &mut OverlappingState, |
| ) -> Result<(), MatchError> { |
| state.mat = None; |
| if input.is_done() { |
| return Ok(()); |
| } |
| // Searching with a pattern ID is always anchored, so we should only ever |
| // use a prefilter when no pattern ID is given. |
| if aut.prefilter().is_some() && !input.get_anchored().is_anchored() { |
| let pre = aut.prefilter().unwrap(); |
| try_find_overlapping_fwd_imp(aut, input, Some(pre), state) |
| } else { |
| try_find_overlapping_fwd_imp(aut, input, None, state) |
| } |
| } |
| |
| #[inline(always)] |
| fn try_find_overlapping_fwd_imp<A: Automaton + ?Sized>( |
| aut: &A, |
| input: &Input<'_>, |
| pre: Option<&Prefilter>, |
| state: &mut OverlappingState, |
| ) -> Result<(), MatchError> { |
| let mut sid = match state.id { |
| None => { |
| let sid = aut.start_state(input.get_anchored())?; |
| // Handle the case where the start state is a match state. That is, |
| // the empty string is in our automaton. We report every match we |
| // can here before moving on and updating 'state.at' and 'state.id' |
| // to find more matches in other parts of the haystack. |
| if aut.is_match(sid) { |
| let i = state.next_match_index.unwrap_or(0); |
| let len = aut.match_len(sid); |
| if i < len { |
| state.next_match_index = Some(i + 1); |
| state.mat = Some(get_match(aut, sid, i, input.start())); |
| return Ok(()); |
| } |
| } |
| state.at = input.start(); |
| state.id = Some(sid); |
| state.next_match_index = None; |
| state.mat = None; |
| sid |
| } |
| Some(sid) => { |
| // If we still have matches left to report in this state then |
| // report them until we've exhausted them. Only after that do we |
| // advance to the next offset in the haystack. |
| if let Some(i) = state.next_match_index { |
| let len = aut.match_len(sid); |
| if i < len { |
| state.next_match_index = Some(i + 1); |
| state.mat = Some(get_match(aut, sid, i, state.at + 1)); |
| return Ok(()); |
| } |
| // Once we've reported all matches at a given position, we need |
| // to advance the search to the next position. |
| state.at += 1; |
| state.next_match_index = None; |
| state.mat = None; |
| } |
| sid |
| } |
| }; |
| while state.at < input.end() { |
| sid = aut.next_state( |
| input.get_anchored(), |
| sid, |
| input.haystack()[state.at], |
| ); |
| if aut.is_special(sid) { |
| state.id = Some(sid); |
| if aut.is_dead(sid) { |
| return Ok(()); |
| } else if aut.is_match(sid) { |
| state.next_match_index = Some(1); |
| state.mat = Some(get_match(aut, sid, 0, state.at + 1)); |
| return Ok(()); |
| } else if let Some(pre) = pre { |
| // If we're here, we know it's a special state that is not a |
| // dead or a match state AND that a prefilter is active. Thus, |
| // it must be a start state. |
| debug_assert!(aut.is_start(sid)); |
| let span = Span::from(state.at..input.end()); |
| match pre.find_in(input.haystack(), span).into_option() { |
| None => return Ok(()), |
| Some(i) => { |
| if i > state.at { |
| state.at = i; |
| continue; |
| } |
| } |
| } |
| } else { |
| // When pre.is_none(), then starting states should not be |
| // treated as special. That is, without a prefilter, is_special |
| // should only return true when the state is a dead or a match |
| // state. |
| // |
| // ... except for one special case: in stream searching, we |
| // currently call overlapping search with a 'None' prefilter, |
| // regardless of whether one exists or not, because stream |
| // searching can't currently deal with prefilters correctly in |
| // all cases. |
| } |
| } |
| state.at += 1; |
| } |
| state.id = Some(sid); |
| Ok(()) |
| } |
| |
| #[inline(always)] |
| fn get_match<A: Automaton + ?Sized>( |
| aut: &A, |
| sid: StateID, |
| index: usize, |
| at: usize, |
| ) -> Match { |
| let pid = aut.match_pattern(sid, index); |
| let len = aut.pattern_len(pid); |
| Match::new(pid, (at - len)..at) |
| } |
| |
| /// Write a prefix "state" indicator for fmt::Debug impls. It always writes |
| /// exactly two printable bytes to the given formatter. |
| /// |
| /// Specifically, this tries to succinctly distinguish the different types of |
| /// states: dead states, start states and match states. It even accounts for |
| /// the possible overlappings of different state types. (The only possible |
| /// overlapping is that of match and start states.) |
| pub(crate) fn fmt_state_indicator<A: Automaton>( |
| f: &mut core::fmt::Formatter<'_>, |
| aut: A, |
| id: StateID, |
| ) -> core::fmt::Result { |
| if aut.is_dead(id) { |
| write!(f, "D ")?; |
| } else if aut.is_match(id) { |
| if aut.is_start(id) { |
| write!(f, "*>")?; |
| } else { |
| write!(f, "* ")?; |
| } |
| } else if aut.is_start(id) { |
| write!(f, " >")?; |
| } else { |
| write!(f, " ")?; |
| } |
| Ok(()) |
| } |
| |
| /// Return an iterator of transitions in a sparse format given an iterator |
| /// of all explicitly defined transitions. The iterator yields ranges of |
| /// transitions, such that any adjacent transitions mapped to the same |
| /// state are combined into a single range. |
| pub(crate) fn sparse_transitions<'a>( |
| mut it: impl Iterator<Item = (u8, StateID)> + 'a, |
| ) -> impl Iterator<Item = (u8, u8, StateID)> + 'a { |
| let mut cur: Option<(u8, u8, StateID)> = None; |
| core::iter::from_fn(move || { |
| while let Some((class, next)) = it.next() { |
| let (prev_start, prev_end, prev_next) = match cur { |
| Some(x) => x, |
| None => { |
| cur = Some((class, class, next)); |
| continue; |
| } |
| }; |
| if prev_next == next { |
| cur = Some((prev_start, class, prev_next)); |
| } else { |
| cur = Some((class, class, next)); |
| return Some((prev_start, prev_end, prev_next)); |
| } |
| } |
| if let Some((start, end, next)) = cur.take() { |
| return Some((start, end, next)); |
| } |
| None |
| }) |
| } |