vendor/regex-1.10.5/src/bytes.rs - toolchain/rustc - Git at Google

 /*!
 Search for regex matches in `&[u8]` haystacks.

 This module provides a nearly identical API via [`Regex`] to the one found in
 the top-level of this crate. There are two important differences:

 1. Matching is done on `&[u8]` instead of `&str`. Additionally, `Vec<u8>`
 is used where `String` would have been used in the top-level API.
 2. Unicode support can be disabled even when disabling it would result in
 matching invalid UTF-8 bytes.

 # Example: match null terminated string

 This shows how to find all null-terminated strings in a slice of bytes. This
 works even if a C string contains invalid UTF-8.

 ```rust
 use regex::bytes::Regex;

 let re = Regex::new(r"(?-u)(?<cstr>[^\x00]+)\x00").unwrap();
 let hay = b"foo\x00qu\xFFux\x00baz\x00";

 // Extract all of the strings without the NUL terminator from each match.
 // The unwrap is OK here since a match requires the `cstr` capture to match.
 let cstrs: Vec<&[u8]> =
     re.captures_iter(hay)
       .map(|c| c.name("cstr").unwrap().as_bytes())
       .collect();
 assert_eq!(cstrs, vec![&b"foo"[..], &b"qu\xFFux"[..], &b"baz"[..]]);
 ```

 # Example: selectively enable Unicode support

 This shows how to match an arbitrary byte pattern followed by a UTF-8 encoded
 string (e.g., to extract a title from a Matroska file):

 ```rust
 use regex::bytes::Regex;

 let re = Regex::new(
     r"(?-u)\x7b\xa9(?:[\x80-\xfe]|[\x40-\xff].)(?u:(.*))"
 ).unwrap();
 let hay = b"\x12\xd0\x3b\x5f\x7b\xa9\x85\xe2\x98\x83\x80\x98\x54\x76\x68\x65";

 // Notice that despite the `.*` at the end, it will only match valid UTF-8
 // because Unicode mode was enabled with the `u` flag. Without the `u` flag,
 // the `.*` would match the rest of the bytes regardless of whether they were
 // valid UTF-8.
 let (_, [title]) = re.captures(hay).unwrap().extract();
 assert_eq!(title, b"\xE2\x98\x83");
 // We can UTF-8 decode the title now. And the unwrap here
 // is correct because the existence of a match guarantees
 // that `title` is valid UTF-8.
 let title = std::str::from_utf8(title).unwrap();
 assert_eq!(title, "☃");
 ```

 In general, if the Unicode flag is enabled in a capture group and that capture
 is part of the overall match, then the capture is *guaranteed* to be valid
 UTF-8.

 # Syntax

 The supported syntax is pretty much the same as the syntax for Unicode
 regular expressions with a few changes that make sense for matching arbitrary
 bytes:

 1. The `u` flag can be disabled even when disabling it might cause the regex to
 match invalid UTF-8. When the `u` flag is disabled, the regex is said to be in
 "ASCII compatible" mode.
 2. In ASCII compatible mode, Unicode character classes are not allowed. Literal
 Unicode scalar values outside of character classes are allowed.
 3. In ASCII compatible mode, Perl character classes (`\w`, `\d` and `\s`)
 revert to their typical ASCII definition. `\w` maps to `[[:word:]]`, `\d` maps
 to `[[:digit:]]` and `\s` maps to `[[:space:]]`.
 4. In ASCII compatible mode, word boundaries use the ASCII compatible `\w` to
 determine whether a byte is a word byte or not.
 5. Hexadecimal notation can be used to specify arbitrary bytes instead of
 Unicode codepoints. For example, in ASCII compatible mode, `\xFF` matches the
 literal byte `\xFF`, while in Unicode mode, `\xFF` is the Unicode codepoint
 `U+00FF` that matches its UTF-8 encoding of `\xC3\xBF`. Similarly for octal
 notation when enabled.
 6. In ASCII compatible mode, `.` matches any *byte* except for `\n`. When the
 `s` flag is additionally enabled, `.` matches any byte.

 # Performance

 In general, one should expect performance on `&[u8]` to be roughly similar to
 performance on `&str`.
 */
 pub use crate::{builders::bytes::*, regex::bytes::*, regexset::bytes::*};
	/*!
	Search for regex matches in `&[u8]` haystacks.

	This module provides a nearly identical API via [`Regex`] to the one found in
	the top-level of this crate. There are two important differences:

	1. Matching is done on `&[u8]` instead of `&str`. Additionally, `Vec<u8>`
	is used where `String` would have been used in the top-level API.
	2. Unicode support can be disabled even when disabling it would result in
	matching invalid UTF-8 bytes.

	# Example: match null terminated string

	This shows how to find all null-terminated strings in a slice of bytes. This
	works even if a C string contains invalid UTF-8.

	```rust
	use regex::bytes::Regex;

	let re = Regex::new(r"(?-u)(?<cstr>[^\x00]+)\x00").unwrap();
	let hay = b"foo\x00qu\xFFux\x00baz\x00";

	// Extract all of the strings without the NUL terminator from each match.
	// The unwrap is OK here since a match requires the `cstr` capture to match.
	let cstrs: Vec<&[u8]> =
	re.captures_iter(hay)
	.map(\|c\| c.name("cstr").unwrap().as_bytes())
	.collect();
	assert_eq!(cstrs, vec![&b"foo"[..], &b"qu\xFFux"[..], &b"baz"[..]]);
	```

	# Example: selectively enable Unicode support

	This shows how to match an arbitrary byte pattern followed by a UTF-8 encoded
	string (e.g., to extract a title from a Matroska file):

	```rust
	use regex::bytes::Regex;

	let re = Regex::new(
	r"(?-u)\x7b\xa9(?:[\x80-\xfe]\|[\x40-\xff].)(?u:(.*))"
	).unwrap();
	let hay = b"\x12\xd0\x3b\x5f\x7b\xa9\x85\xe2\x98\x83\x80\x98\x54\x76\x68\x65";

	// Notice that despite the `.*` at the end, it will only match valid UTF-8
	// because Unicode mode was enabled with the `u` flag. Without the `u` flag,
	// the `.*` would match the rest of the bytes regardless of whether they were
	// valid UTF-8.
	let (_, [title]) = re.captures(hay).unwrap().extract();
	assert_eq!(title, b"\xE2\x98\x83");
	// We can UTF-8 decode the title now. And the unwrap here
	// is correct because the existence of a match guarantees
	// that `title` is valid UTF-8.
	let title = std::str::from_utf8(title).unwrap();
	assert_eq!(title, "☃");
	```

	In general, if the Unicode flag is enabled in a capture group and that capture
	is part of the overall match, then the capture is guaranteed to be valid
	UTF-8.

	# Syntax

	The supported syntax is pretty much the same as the syntax for Unicode
	regular expressions with a few changes that make sense for matching arbitrary
	bytes:

	1. The `u` flag can be disabled even when disabling it might cause the regex to
	match invalid UTF-8. When the `u` flag is disabled, the regex is said to be in
	"ASCII compatible" mode.
	2. In ASCII compatible mode, Unicode character classes are not allowed. Literal
	Unicode scalar values outside of character classes are allowed.
	3. In ASCII compatible mode, Perl character classes (`\w`, `\d` and `\s`)
	revert to their typical ASCII definition. `\w` maps to `[[:word:]]`, `\d` maps
	to `[[:digit:]]` and `\s` maps to `[[:space:]]`.
	4. In ASCII compatible mode, word boundaries use the ASCII compatible `\w` to
	determine whether a byte is a word byte or not.
	5. Hexadecimal notation can be used to specify arbitrary bytes instead of
	Unicode codepoints. For example, in ASCII compatible mode, `\xFF` matches the
	literal byte `\xFF`, while in Unicode mode, `\xFF` is the Unicode codepoint
	`U+00FF` that matches its UTF-8 encoding of `\xC3\xBF`. Similarly for octal
	notation when enabled.
	6. In ASCII compatible mode, `.` matches any byte except for `\n`. When the
	`s` flag is additionally enabled, `.` matches any byte.

	# Performance

	In general, one should expect performance on `&[u8]` to be roughly similar to
	performance on `&str`.
	*/
	pub use crate::{builders::bytes::, regex::bytes::, regexset::bytes::*};