| ----------------------------------------------------------------------------- |
| This file contains a concatenation of the PCRE2 man pages, converted to plain |
| text format for ease of searching with a text editor, or for use on systems |
| that do not have a man page processor. The small individual files that give |
| synopses of each function in the library have not been included. Neither has |
| the pcre2demo program. There are separate text files for the pcre2grep and |
| pcre2test commands. |
| ----------------------------------------------------------------------------- |
| |
| |
| PCRE2(3) Library Functions Manual PCRE2(3) |
| |
| |
| |
| NAME |
| PCRE2 - Perl-compatible regular expressions (revised API) |
| |
| INTRODUCTION |
| |
| PCRE2 is the name used for a revised API for the PCRE library, which is |
| a set of functions, written in C, that implement regular expression |
| pattern matching using the same syntax and semantics as Perl, with just |
| a few differences. After nearly two decades, the limitations of the |
| original API were making development increasingly difficult. The new |
| API is more extensible, and it was simplified by abolishing the sepa- |
| rate "study" optimizing function; in PCRE2, patterns are automatically |
| optimized where possible. Since forking from PCRE1, the code has been |
| extensively refactored and new features introduced. The old library is |
| now obsolete and is no longer maintained. |
| |
| As well as Perl-style regular expression patterns, some features that |
| appeared in Python and the original PCRE before they appeared in Perl |
| are available using the Python syntax. There is also some support for |
| one or two .NET and Oniguruma syntax items, and there are options for |
| requesting some minor changes that give better ECMAScript (aka Java- |
| Script) compatibility. |
| |
| The source code for PCRE2 can be compiled to support strings of 8-bit, |
| 16-bit, or 32-bit code units, which means that up to three separate li- |
| braries may be installed, one for each code unit size. The size of code |
| unit is not related to the bit size of the underlying hardware. In a |
| 64-bit environment that also supports 32-bit applications, versions of |
| PCRE2 that are compiled in both 64-bit and 32-bit modes may be needed. |
| |
| The original work to extend PCRE to 16-bit and 32-bit code units was |
| done by Zoltan Herczeg and Christian Persch, respectively. In all three |
| cases, strings can be interpreted either as one character per code |
| unit, or as UTF-encoded Unicode, with support for Unicode general cate- |
| gory properties. Unicode support is optional at build time (but is the |
| default). However, processing strings as UTF code units must be enabled |
| explicitly at run time. The version of Unicode in use can be discovered |
| by running |
| |
| pcre2test -C |
| |
| The three libraries contain identical sets of functions, with names |
| ending in _8, _16, or _32, respectively (for example, pcre2_com- |
| pile_8()). However, by defining PCRE2_CODE_UNIT_WIDTH to be 8, 16, or |
| 32, a program that uses just one code unit width can be written using |
| generic names such as pcre2_compile(), and the documentation is written |
| assuming that this is the case. |
| |
| In addition to the Perl-compatible matching function, PCRE2 contains an |
| alternative function that matches the same compiled patterns in a dif- |
| ferent way. In certain circumstances, the alternative function has some |
| advantages. For a discussion of the two matching algorithms, see the |
| pcre2matching page. |
| |
| Details of exactly which Perl regular expression features are and are |
| not supported by PCRE2 are given in separate documents. See the |
| pcre2pattern and pcre2compat pages. There is a syntax summary in the |
| pcre2syntax page. |
| |
| Some features of PCRE2 can be included, excluded, or changed when the |
| library is built. The pcre2_config() function makes it possible for a |
| client to discover which features are available. The features them- |
| selves are described in the pcre2build page. Documentation about build- |
| ing PCRE2 for various operating systems can be found in the README and |
| NON-AUTOTOOLS_BUILD files in the source distribution. |
| |
| The libraries contains a number of undocumented internal functions and |
| data tables that are used by more than one of the exported external |
| functions, but which are not intended for use by external callers. |
| Their names all begin with "_pcre2", which hopefully will not provoke |
| any name clashes. In some environments, it is possible to control which |
| external symbols are exported when a shared library is built, and in |
| these cases the undocumented symbols are not exported. |
| |
| |
| SECURITY CONSIDERATIONS |
| |
| If you are using PCRE2 in a non-UTF application that permits users to |
| supply arbitrary patterns for compilation, you should be aware of a |
| feature that allows users to turn on UTF support from within a pattern. |
| For example, an 8-bit pattern that begins with "(*UTF)" turns on UTF-8 |
| mode, which interprets patterns and subjects as strings of UTF-8 code |
| units instead of individual 8-bit characters. This causes both the pat- |
| tern and any data against which it is matched to be checked for UTF-8 |
| validity. If the data string is very long, such a check might use suf- |
| ficiently many resources as to cause your application to lose perfor- |
| mance. |
| |
| One way of guarding against this possibility is to use the pcre2_pat- |
| tern_info() function to check the compiled pattern's options for |
| PCRE2_UTF. Alternatively, you can set the PCRE2_NEVER_UTF option when |
| calling pcre2_compile(). This causes a compile time error if the pat- |
| tern contains a UTF-setting sequence. |
| |
| The use of Unicode properties for character types such as \d can also |
| be enabled from within the pattern, by specifying "(*UCP)". This fea- |
| ture can be disallowed by setting the PCRE2_NEVER_UCP option. |
| |
| If your application is one that supports UTF, be aware that validity |
| checking can take time. If the same data string is to be matched many |
| times, you can use the PCRE2_NO_UTF_CHECK option for the second and |
| subsequent matches to avoid running redundant checks. |
| |
| The use of the \C escape sequence in a UTF-8 or UTF-16 pattern can lead |
| to problems, because it may leave the current matching point in the |
| middle of a multi-code-unit character. The PCRE2_NEVER_BACKSLASH_C op- |
| tion can be used by an application to lock out the use of \C, causing a |
| compile-time error if it is encountered. It is also possible to build |
| PCRE2 with the use of \C permanently disabled. |
| |
| Another way that performance can be hit is by running a pattern that |
| has a very large search tree against a string that will never match. |
| Nested unlimited repeats in a pattern are a common example. PCRE2 pro- |
| vides some protection against this: see the pcre2_set_match_limit() |
| function in the pcre2api page. There is a similar function called |
| pcre2_set_depth_limit() that can be used to restrict the amount of mem- |
| ory that is used. |
| |
| |
| USER DOCUMENTATION |
| |
| The user documentation for PCRE2 comprises a number of different sec- |
| tions. In the "man" format, each of these is a separate "man page". In |
| the HTML format, each is a separate page, linked from the index page. |
| In the plain text format, the descriptions of the pcre2grep and |
| pcre2test programs are in files called pcre2grep.txt and pcre2test.txt, |
| respectively. The remaining sections, except for the pcre2demo section |
| (which is a program listing), and the short pages for individual func- |
| tions, are concatenated in pcre2.txt, for ease of searching. The sec- |
| tions are as follows: |
| |
| pcre2 this document |
| pcre2-config show PCRE2 installation configuration information |
| pcre2api details of PCRE2's native C API |
| pcre2build building PCRE2 |
| pcre2callout details of the pattern callout feature |
| pcre2compat discussion of Perl compatibility |
| pcre2convert details of pattern conversion functions |
| pcre2demo a demonstration C program that uses PCRE2 |
| pcre2grep description of the pcre2grep command (8-bit only) |
| pcre2jit discussion of just-in-time optimization support |
| pcre2limits details of size and other limits |
| pcre2matching discussion of the two matching algorithms |
| pcre2partial details of the partial matching facility |
| pcre2pattern syntax and semantics of supported regular |
| expression patterns |
| pcre2perform discussion of performance issues |
| pcre2posix the POSIX-compatible C API for the 8-bit library |
| pcre2sample discussion of the pcre2demo program |
| pcre2serialize details of pattern serialization |
| pcre2syntax quick syntax reference |
| pcre2test description of the pcre2test command |
| pcre2unicode discussion of Unicode and UTF support |
| |
| In the "man" and HTML formats, there is also a short page for each C |
| library function, listing its arguments and results. |
| |
| |
| AUTHOR |
| |
| Philip Hazel |
| Retired from University Computing Service |
| Cambridge, England. |
| |
| Putting an actual email address here is a spam magnet. If you want to |
| email me, use my two names separated by a dot at gmail.com. |
| |
| |
| REVISION |
| |
| Last updated: 27 August 2021 |
| Copyright (c) 1997-2021 University of Cambridge. |
| ------------------------------------------------------------------------------ |
| |
| |
| PCRE2API(3) Library Functions Manual PCRE2API(3) |
| |
| |
| |
| NAME |
| PCRE2 - Perl-compatible regular expressions (revised API) |
| |
| #include <pcre2.h> |
| |
| PCRE2 is a new API for PCRE, starting at release 10.0. This document |
| contains a description of all its native functions. See the pcre2 docu- |
| ment for an overview of all the PCRE2 documentation. |
| |
| |
| PCRE2 NATIVE API BASIC FUNCTIONS |
| |
| pcre2_code *pcre2_compile(PCRE2_SPTR pattern, PCRE2_SIZE length, |
| uint32_t options, int *errorcode, PCRE2_SIZE *erroroffset, |
| pcre2_compile_context *ccontext); |
| |
| void pcre2_code_free(pcre2_code *code); |
| |
| pcre2_match_data *pcre2_match_data_create(uint32_t ovecsize, |
| pcre2_general_context *gcontext); |
| |
| pcre2_match_data *pcre2_match_data_create_from_pattern( |
| const pcre2_code *code, pcre2_general_context *gcontext); |
| |
| int pcre2_match(const pcre2_code *code, PCRE2_SPTR subject, |
| PCRE2_SIZE length, PCRE2_SIZE startoffset, |
| uint32_t options, pcre2_match_data *match_data, |
| pcre2_match_context *mcontext); |
| |
| int pcre2_dfa_match(const pcre2_code *code, PCRE2_SPTR subject, |
| PCRE2_SIZE length, PCRE2_SIZE startoffset, |
| uint32_t options, pcre2_match_data *match_data, |
| pcre2_match_context *mcontext, |
| int *workspace, PCRE2_SIZE wscount); |
| |
| void pcre2_match_data_free(pcre2_match_data *match_data); |
| |
| |
| PCRE2 NATIVE API AUXILIARY MATCH FUNCTIONS |
| |
| PCRE2_SPTR pcre2_get_mark(pcre2_match_data *match_data); |
| |
| uint32_t pcre2_get_ovector_count(pcre2_match_data *match_data); |
| |
| PCRE2_SIZE *pcre2_get_ovector_pointer(pcre2_match_data *match_data); |
| |
| PCRE2_SIZE pcre2_get_startchar(pcre2_match_data *match_data); |
| |
| |
| PCRE2 NATIVE API GENERAL CONTEXT FUNCTIONS |
| |
| pcre2_general_context *pcre2_general_context_create( |
| void *(*private_malloc)(PCRE2_SIZE, void *), |
| void (*private_free)(void *, void *), void *memory_data); |
| |
| pcre2_general_context *pcre2_general_context_copy( |
| pcre2_general_context *gcontext); |
| |
| void pcre2_general_context_free(pcre2_general_context *gcontext); |
| |
| |
| PCRE2 NATIVE API COMPILE CONTEXT FUNCTIONS |
| |
| pcre2_compile_context *pcre2_compile_context_create( |
| pcre2_general_context *gcontext); |
| |
| pcre2_compile_context *pcre2_compile_context_copy( |
| pcre2_compile_context *ccontext); |
| |
| void pcre2_compile_context_free(pcre2_compile_context *ccontext); |
| |
| int pcre2_set_bsr(pcre2_compile_context *ccontext, |
| uint32_t value); |
| |
| int pcre2_set_character_tables(pcre2_compile_context *ccontext, |
| const uint8_t *tables); |
| |
| int pcre2_set_compile_extra_options(pcre2_compile_context *ccontext, |
| uint32_t extra_options); |
| |
| int pcre2_set_max_pattern_length(pcre2_compile_context *ccontext, |
| PCRE2_SIZE value); |
| |
| int pcre2_set_newline(pcre2_compile_context *ccontext, |
| uint32_t value); |
| |
| int pcre2_set_parens_nest_limit(pcre2_compile_context *ccontext, |
| uint32_t value); |
| |
| int pcre2_set_compile_recursion_guard(pcre2_compile_context *ccontext, |
| int (*guard_function)(uint32_t, void *), void *user_data); |
| |
| |
| PCRE2 NATIVE API MATCH CONTEXT FUNCTIONS |
| |
| pcre2_match_context *pcre2_match_context_create( |
| pcre2_general_context *gcontext); |
| |
| pcre2_match_context *pcre2_match_context_copy( |
| pcre2_match_context *mcontext); |
| |
| void pcre2_match_context_free(pcre2_match_context *mcontext); |
| |
| int pcre2_set_callout(pcre2_match_context *mcontext, |
| int (*callout_function)(pcre2_callout_block *, void *), |
| void *callout_data); |
| |
| int pcre2_set_substitute_callout(pcre2_match_context *mcontext, |
| int (*callout_function)(pcre2_substitute_callout_block *, void *), |
| void *callout_data); |
| |
| int pcre2_set_offset_limit(pcre2_match_context *mcontext, |
| PCRE2_SIZE value); |
| |
| int pcre2_set_heap_limit(pcre2_match_context *mcontext, |
| uint32_t value); |
| |
| int pcre2_set_match_limit(pcre2_match_context *mcontext, |
| uint32_t value); |
| |
| int pcre2_set_depth_limit(pcre2_match_context *mcontext, |
| uint32_t value); |
| |
| |
| PCRE2 NATIVE API STRING EXTRACTION FUNCTIONS |
| |
| int pcre2_substring_copy_byname(pcre2_match_data *match_data, |
| PCRE2_SPTR name, PCRE2_UCHAR *buffer, PCRE2_SIZE *bufflen); |
| |
| int pcre2_substring_copy_bynumber(pcre2_match_data *match_data, |
| uint32_t number, PCRE2_UCHAR *buffer, |
| PCRE2_SIZE *bufflen); |
| |
| void pcre2_substring_free(PCRE2_UCHAR *buffer); |
| |
| int pcre2_substring_get_byname(pcre2_match_data *match_data, |
| PCRE2_SPTR name, PCRE2_UCHAR **bufferptr, PCRE2_SIZE *bufflen); |
| |
| int pcre2_substring_get_bynumber(pcre2_match_data *match_data, |
| uint32_t number, PCRE2_UCHAR **bufferptr, |
| PCRE2_SIZE *bufflen); |
| |
| int pcre2_substring_length_byname(pcre2_match_data *match_data, |
| PCRE2_SPTR name, PCRE2_SIZE *length); |
| |
| int pcre2_substring_length_bynumber(pcre2_match_data *match_data, |
| uint32_t number, PCRE2_SIZE *length); |
| |
| int pcre2_substring_nametable_scan(const pcre2_code *code, |
| PCRE2_SPTR name, PCRE2_SPTR *first, PCRE2_SPTR *last); |
| |
| int pcre2_substring_number_from_name(const pcre2_code *code, |
| PCRE2_SPTR name); |
| |
| void pcre2_substring_list_free(PCRE2_SPTR *list); |
| |
| int pcre2_substring_list_get(pcre2_match_data *match_data, |
| PCRE2_UCHAR ***listptr, PCRE2_SIZE **lengthsptr); |
| |
| |
| PCRE2 NATIVE API STRING SUBSTITUTION FUNCTION |
| |
| int pcre2_substitute(const pcre2_code *code, PCRE2_SPTR subject, |
| PCRE2_SIZE length, PCRE2_SIZE startoffset, |
| uint32_t options, pcre2_match_data *match_data, |
| pcre2_match_context *mcontext, PCRE2_SPTR replacementz, |
| PCRE2_SIZE rlength, PCRE2_UCHAR *outputbuffer, |
| PCRE2_SIZE *outlengthptr); |
| |
| |
| PCRE2 NATIVE API JIT FUNCTIONS |
| |
| int pcre2_jit_compile(pcre2_code *code, uint32_t options); |
| |
| int pcre2_jit_match(const pcre2_code *code, PCRE2_SPTR subject, |
| PCRE2_SIZE length, PCRE2_SIZE startoffset, |
| uint32_t options, pcre2_match_data *match_data, |
| pcre2_match_context *mcontext); |
| |
| void pcre2_jit_free_unused_memory(pcre2_general_context *gcontext); |
| |
| pcre2_jit_stack *pcre2_jit_stack_create(PCRE2_SIZE startsize, |
| PCRE2_SIZE maxsize, pcre2_general_context *gcontext); |
| |
| void pcre2_jit_stack_assign(pcre2_match_context *mcontext, |
| pcre2_jit_callback callback_function, void *callback_data); |
| |
| void pcre2_jit_stack_free(pcre2_jit_stack *jit_stack); |
| |
| |
| PCRE2 NATIVE API SERIALIZATION FUNCTIONS |
| |
| int32_t pcre2_serialize_decode(pcre2_code **codes, |
| int32_t number_of_codes, const uint8_t *bytes, |
| pcre2_general_context *gcontext); |
| |
| int32_t pcre2_serialize_encode(const pcre2_code **codes, |
| int32_t number_of_codes, uint8_t **serialized_bytes, |
| PCRE2_SIZE *serialized_size, pcre2_general_context *gcontext); |
| |
| void pcre2_serialize_free(uint8_t *bytes); |
| |
| int32_t pcre2_serialize_get_number_of_codes(const uint8_t *bytes); |
| |
| |
| PCRE2 NATIVE API AUXILIARY FUNCTIONS |
| |
| pcre2_code *pcre2_code_copy(const pcre2_code *code); |
| |
| pcre2_code *pcre2_code_copy_with_tables(const pcre2_code *code); |
| |
| int pcre2_get_error_message(int errorcode, PCRE2_UCHAR *buffer, |
| PCRE2_SIZE bufflen); |
| |
| const uint8_t *pcre2_maketables(pcre2_general_context *gcontext); |
| |
| void pcre2_maketables_free(pcre2_general_context *gcontext, |
| const uint8_t *tables); |
| |
| int pcre2_pattern_info(const pcre2_code *code, uint32_t what, |
| void *where); |
| |
| int pcre2_callout_enumerate(const pcre2_code *code, |
| int (*callback)(pcre2_callout_enumerate_block *, void *), |
| void *user_data); |
| |
| int pcre2_config(uint32_t what, void *where); |
| |
| |
| PCRE2 NATIVE API OBSOLETE FUNCTIONS |
| |
| int pcre2_set_recursion_limit(pcre2_match_context *mcontext, |
| uint32_t value); |
| |
| int pcre2_set_recursion_memory_management( |
| pcre2_match_context *mcontext, |
| void *(*private_malloc)(PCRE2_SIZE, void *), |
| void (*private_free)(void *, void *), void *memory_data); |
| |
| These functions became obsolete at release 10.30 and are retained only |
| for backward compatibility. They should not be used in new code. The |
| first is replaced by pcre2_set_depth_limit(); the second is no longer |
| needed and has no effect (it always returns zero). |
| |
| |
| PCRE2 EXPERIMENTAL PATTERN CONVERSION FUNCTIONS |
| |
| pcre2_convert_context *pcre2_convert_context_create( |
| pcre2_general_context *gcontext); |
| |
| pcre2_convert_context *pcre2_convert_context_copy( |
| pcre2_convert_context *cvcontext); |
| |
| void pcre2_convert_context_free(pcre2_convert_context *cvcontext); |
| |
| int pcre2_set_glob_escape(pcre2_convert_context *cvcontext, |
| uint32_t escape_char); |
| |
| int pcre2_set_glob_separator(pcre2_convert_context *cvcontext, |
| uint32_t separator_char); |
| |
| int pcre2_pattern_convert(PCRE2_SPTR pattern, PCRE2_SIZE length, |
| uint32_t options, PCRE2_UCHAR **buffer, |
| PCRE2_SIZE *blength, pcre2_convert_context *cvcontext); |
| |
| void pcre2_converted_pattern_free(PCRE2_UCHAR *converted_pattern); |
| |
| These functions provide a way of converting non-PCRE2 patterns into |
| patterns that can be processed by pcre2_compile(). This facility is ex- |
| perimental and may be changed in future releases. At present, "globs" |
| and POSIX basic and extended patterns can be converted. Details are |
| given in the pcre2convert documentation. |
| |
| |
| PCRE2 8-BIT, 16-BIT, AND 32-BIT LIBRARIES |
| |
| There are three PCRE2 libraries, supporting 8-bit, 16-bit, and 32-bit |
| code units, respectively. However, there is just one header file, |
| pcre2.h. This contains the function prototypes and other definitions |
| for all three libraries. One, two, or all three can be installed simul- |
| taneously. On Unix-like systems the libraries are called libpcre2-8, |
| libpcre2-16, and libpcre2-32, and they can also co-exist with the orig- |
| inal PCRE libraries. |
| |
| Character strings are passed to and from a PCRE2 library as a sequence |
| of unsigned integers in code units of the appropriate width. Every |
| PCRE2 function comes in three different forms, one for each library, |
| for example: |
| |
| pcre2_compile_8() |
| pcre2_compile_16() |
| pcre2_compile_32() |
| |
| There are also three different sets of data types: |
| |
| PCRE2_UCHAR8, PCRE2_UCHAR16, PCRE2_UCHAR32 |
| PCRE2_SPTR8, PCRE2_SPTR16, PCRE2_SPTR32 |
| |
| The UCHAR types define unsigned code units of the appropriate widths. |
| For example, PCRE2_UCHAR16 is usually defined as `uint16_t'. The SPTR |
| types are constant pointers to the equivalent UCHAR types, that is, |
| they are pointers to vectors of unsigned code units. |
| |
| Many applications use only one code unit width. For their convenience, |
| macros are defined whose names are the generic forms such as pcre2_com- |
| pile() and PCRE2_SPTR. These macros use the value of the macro |
| PCRE2_CODE_UNIT_WIDTH to generate the appropriate width-specific func- |
| tion and macro names. PCRE2_CODE_UNIT_WIDTH is not defined by default. |
| An application must define it to be 8, 16, or 32 before including |
| pcre2.h in order to make use of the generic names. |
| |
| Applications that use more than one code unit width can be linked with |
| more than one PCRE2 library, but must define PCRE2_CODE_UNIT_WIDTH to |
| be 0 before including pcre2.h, and then use the real function names. |
| Any code that is to be included in an environment where the value of |
| PCRE2_CODE_UNIT_WIDTH is unknown should also use the real function |
| names. (Unfortunately, it is not possible in C code to save and restore |
| the value of a macro.) |
| |
| If PCRE2_CODE_UNIT_WIDTH is not defined before including pcre2.h, a |
| compiler error occurs. |
| |
| When using multiple libraries in an application, you must take care |
| when processing any particular pattern to use only functions from a |
| single library. For example, if you want to run a match using a pat- |
| tern that was compiled with pcre2_compile_16(), you must do so with |
| pcre2_match_16(), not pcre2_match_8() or pcre2_match_32(). |
| |
| In the function summaries above, and in the rest of this document and |
| other PCRE2 documents, functions and data types are described using |
| their generic names, without the _8, _16, or _32 suffix. |
| |
| |
| PCRE2 API OVERVIEW |
| |
| PCRE2 has its own native API, which is described in this document. |
| There are also some wrapper functions for the 8-bit library that corre- |
| spond to the POSIX regular expression API, but they do not give access |
| to all the functionality of PCRE2. They are described in the pcre2posix |
| documentation. Both these APIs define a set of C function calls. |
| |
| The native API C data types, function prototypes, option values, and |
| error codes are defined in the header file pcre2.h, which also contains |
| definitions of PCRE2_MAJOR and PCRE2_MINOR, the major and minor release |
| numbers for the library. Applications can use these to include support |
| for different releases of PCRE2. |
| |
| In a Windows environment, if you want to statically link an application |
| program against a non-dll PCRE2 library, you must define PCRE2_STATIC |
| before including pcre2.h. |
| |
| The functions pcre2_compile() and pcre2_match() are used for compiling |
| and matching regular expressions in a Perl-compatible manner. A sample |
| program that demonstrates the simplest way of using them is provided in |
| the file called pcre2demo.c in the PCRE2 source distribution. A listing |
| of this program is given in the pcre2demo documentation, and the |
| pcre2sample documentation describes how to compile and run it. |
| |
| The compiling and matching functions recognize various options that are |
| passed as bits in an options argument. There are also some more compli- |
| cated parameters such as custom memory management functions and re- |
| source limits that are passed in "contexts" (which are just memory |
| blocks, described below). Simple applications do not need to make use |
| of contexts. |
| |
| Just-in-time (JIT) compiler support is an optional feature of PCRE2 |
| that can be built in appropriate hardware environments. It greatly |
| speeds up the matching performance of many patterns. Programs can re- |
| quest that it be used if available by calling pcre2_jit_compile() after |
| a pattern has been successfully compiled by pcre2_compile(). This does |
| nothing if JIT support is not available. |
| |
| More complicated programs might need to make use of the specialist |
| functions pcre2_jit_stack_create(), pcre2_jit_stack_free(), and |
| pcre2_jit_stack_assign() in order to control the JIT code's memory us- |
| age. |
| |
| JIT matching is automatically used by pcre2_match() if it is available, |
| unless the PCRE2_NO_JIT option is set. There is also a direct interface |
| for JIT matching, which gives improved performance at the expense of |
| less sanity checking. The JIT-specific functions are discussed in the |
| pcre2jit documentation. |
| |
| A second matching function, pcre2_dfa_match(), which is not Perl-com- |
| patible, is also provided. This uses a different algorithm for the |
| matching. The alternative algorithm finds all possible matches (at a |
| given point in the subject), and scans the subject just once (unless |
| there are lookaround assertions). However, this algorithm does not re- |
| turn captured substrings. A description of the two matching algorithms |
| and their advantages and disadvantages is given in the pcre2matching |
| documentation. There is no JIT support for pcre2_dfa_match(). |
| |
| In addition to the main compiling and matching functions, there are |
| convenience functions for extracting captured substrings from a subject |
| string that has been matched by pcre2_match(). They are: |
| |
| pcre2_substring_copy_byname() |
| pcre2_substring_copy_bynumber() |
| pcre2_substring_get_byname() |
| pcre2_substring_get_bynumber() |
| pcre2_substring_list_get() |
| pcre2_substring_length_byname() |
| pcre2_substring_length_bynumber() |
| pcre2_substring_nametable_scan() |
| pcre2_substring_number_from_name() |
| |
| pcre2_substring_free() and pcre2_substring_list_free() are also pro- |
| vided, to free memory used for extracted strings. If either of these |
| functions is called with a NULL argument, the function returns immedi- |
| ately without doing anything. |
| |
| The function pcre2_substitute() can be called to match a pattern and |
| return a copy of the subject string with substitutions for parts that |
| were matched. |
| |
| Functions whose names begin with pcre2_serialize_ are used for saving |
| compiled patterns on disc or elsewhere, and reloading them later. |
| |
| Finally, there are functions for finding out information about a com- |
| piled pattern (pcre2_pattern_info()) and about the configuration with |
| which PCRE2 was built (pcre2_config()). |
| |
| Functions with names ending with _free() are used for freeing memory |
| blocks of various sorts. In all cases, if one of these functions is |
| called with a NULL argument, it does nothing. |
| |
| |
| STRING LENGTHS AND OFFSETS |
| |
| The PCRE2 API uses string lengths and offsets into strings of code |
| units in several places. These values are always of type PCRE2_SIZE, |
| which is an unsigned integer type, currently always defined as size_t. |
| The largest value that can be stored in such a type (that is |
| ~(PCRE2_SIZE)0) is reserved as a special indicator for zero-terminated |
| strings and unset offsets. Therefore, the longest string that can be |
| handled is one less than this maximum. |
| |
| |
| NEWLINES |
| |
| PCRE2 supports five different conventions for indicating line breaks in |
| strings: a single CR (carriage return) character, a single LF (line- |
| feed) character, the two-character sequence CRLF, any of the three pre- |
| ceding, or any Unicode newline sequence. The Unicode newline sequences |
| are the three just mentioned, plus the single characters VT (vertical |
| tab, U+000B), FF (form feed, U+000C), NEL (next line, U+0085), LS (line |
| separator, U+2028), and PS (paragraph separator, U+2029). |
| |
| Each of the first three conventions is used by at least one operating |
| system as its standard newline sequence. When PCRE2 is built, a default |
| can be specified. If it is not, the default is set to LF, which is the |
| Unix standard. However, the newline convention can be changed by an ap- |
| plication when calling pcre2_compile(), or it can be specified by spe- |
| cial text at the start of the pattern itself; this overrides any other |
| settings. See the pcre2pattern page for details of the special charac- |
| ter sequences. |
| |
| In the PCRE2 documentation the word "newline" is used to mean "the |
| character or pair of characters that indicate a line break". The choice |
| of newline convention affects the handling of the dot, circumflex, and |
| dollar metacharacters, the handling of #-comments in /x mode, and, when |
| CRLF is a recognized line ending sequence, the match position advance- |
| ment for a non-anchored pattern. There is more detail about this in the |
| section on pcre2_match() options below. |
| |
| The choice of newline convention does not affect the interpretation of |
| the \n or \r escape sequences, nor does it affect what \R matches; this |
| has its own separate convention. |
| |
| |
| MULTITHREADING |
| |
| In a multithreaded application it is important to keep thread-specific |
| data separate from data that can be shared between threads. The PCRE2 |
| library code itself is thread-safe: it contains no static or global |
| variables. The API is designed to be fairly simple for non-threaded ap- |
| plications while at the same time ensuring that multithreaded applica- |
| tions can use it. |
| |
| There are several different blocks of data that are used to pass infor- |
| mation between the application and the PCRE2 libraries. |
| |
| The compiled pattern |
| |
| A pointer to the compiled form of a pattern is returned to the user |
| when pcre2_compile() is successful. The data in the compiled pattern is |
| fixed, and does not change when the pattern is matched. Therefore, it |
| is thread-safe, that is, the same compiled pattern can be used by more |
| than one thread simultaneously. For example, an application can compile |
| all its patterns at the start, before forking off multiple threads that |
| use them. However, if the just-in-time (JIT) optimization feature is |
| being used, it needs separate memory stack areas for each thread. See |
| the pcre2jit documentation for more details. |
| |
| In a more complicated situation, where patterns are compiled only when |
| they are first needed, but are still shared between threads, pointers |
| to compiled patterns must be protected from simultaneous writing by |
| multiple threads. This is somewhat tricky to do correctly. If you know |
| that writing to a pointer is atomic in your environment, you can use |
| logic like this: |
| |
| Get a read-only (shared) lock (mutex) for pointer |
| if (pointer == NULL) |
| { |
| Get a write (unique) lock for pointer |
| if (pointer == NULL) pointer = pcre2_compile(... |
| } |
| Release the lock |
| Use pointer in pcre2_match() |
| |
| Of course, testing for compilation errors should also be included in |
| the code. |
| |
| The reason for checking the pointer a second time is as follows: Sev- |
| eral threads may have acquired the shared lock and tested the pointer |
| for being NULL, but only one of them will be given the write lock, with |
| the rest kept waiting. The winning thread will compile the pattern and |
| store the result. After this thread releases the write lock, another |
| thread will get it, and if it does not retest pointer for being NULL, |
| will recompile the pattern and overwrite the pointer, creating a memory |
| leak and possibly causing other issues. |
| |
| In an environment where writing to a pointer may not be atomic, the |
| above logic is not sufficient. The thread that is doing the compiling |
| may be descheduled after writing only part of the pointer, which could |
| cause other threads to use an invalid value. Instead of checking the |
| pointer itself, a separate "pointer is valid" flag (that can be updated |
| atomically) must be used: |
| |
| Get a read-only (shared) lock (mutex) for pointer |
| if (!pointer_is_valid) |
| { |
| Get a write (unique) lock for pointer |
| if (!pointer_is_valid) |
| { |
| pointer = pcre2_compile(... |
| pointer_is_valid = TRUE |
| } |
| } |
| Release the lock |
| Use pointer in pcre2_match() |
| |
| If JIT is being used, but the JIT compilation is not being done immedi- |
| ately (perhaps waiting to see if the pattern is used often enough), |
| similar logic is required. JIT compilation updates a value within the |
| compiled code block, so a thread must gain unique write access to the |
| pointer before calling pcre2_jit_compile(). Alternatively, |
| pcre2_code_copy() or pcre2_code_copy_with_tables() can be used to ob- |
| tain a private copy of the compiled code before calling the JIT com- |
| piler. |
| |
| Context blocks |
| |
| The next main section below introduces the idea of "contexts" in which |
| PCRE2 functions are called. A context is nothing more than a collection |
| of parameters that control the way PCRE2 operates. Grouping a number of |
| parameters together in a context is a convenient way of passing them to |
| a PCRE2 function without using lots of arguments. The parameters that |
| are stored in contexts are in some sense "advanced features" of the |
| API. Many straightforward applications will not need to use contexts. |
| |
| In a multithreaded application, if the parameters in a context are val- |
| ues that are never changed, the same context can be used by all the |
| threads. However, if any thread needs to change any value in a context, |
| it must make its own thread-specific copy. |
| |
| Match blocks |
| |
| The matching functions need a block of memory for storing the results |
| of a match. This includes details of what was matched, as well as addi- |
| tional information such as the name of a (*MARK) setting. Each thread |
| must provide its own copy of this memory. |
| |
| |
| PCRE2 CONTEXTS |
| |
| Some PCRE2 functions have a lot of parameters, many of which are used |
| only by specialist applications, for example, those that use custom |
| memory management or non-standard character tables. To keep function |
| argument lists at a reasonable size, and at the same time to keep the |
| API extensible, "uncommon" parameters are passed to certain functions |
| in a context instead of directly. A context is just a block of memory |
| that holds the parameter values. Applications that do not need to ad- |
| just any of the context parameters can pass NULL when a context pointer |
| is required. |
| |
| There are three different types of context: a general context that is |
| relevant for several PCRE2 operations, a compile-time context, and a |
| match-time context. |
| |
| The general context |
| |
| At present, this context just contains pointers to (and data for) ex- |
| ternal memory management functions that are called from several places |
| in the PCRE2 library. The context is named `general' rather than |
| specifically `memory' because in future other fields may be added. If |
| you do not want to supply your own custom memory management functions, |
| you do not need to bother with a general context. A general context is |
| created by: |
| |
| pcre2_general_context *pcre2_general_context_create( |
| void *(*private_malloc)(PCRE2_SIZE, void *), |
| void (*private_free)(void *, void *), void *memory_data); |
| |
| The two function pointers specify custom memory management functions, |
| whose prototypes are: |
| |
| void *private_malloc(PCRE2_SIZE, void *); |
| void private_free(void *, void *); |
| |
| Whenever code in PCRE2 calls these functions, the final argument is the |
| value of memory_data. Either of the first two arguments of the creation |
| function may be NULL, in which case the system memory management func- |
| tions malloc() and free() are used. (This is not currently useful, as |
| there are no other fields in a general context, but in future there |
| might be.) The private_malloc() function is used (if supplied) to ob- |
| tain memory for storing the context, and all three values are saved as |
| part of the context. |
| |
| Whenever PCRE2 creates a data block of any kind, the block contains a |
| pointer to the free() function that matches the malloc() function that |
| was used. When the time comes to free the block, this function is |
| called. |
| |
| A general context can be copied by calling: |
| |
| pcre2_general_context *pcre2_general_context_copy( |
| pcre2_general_context *gcontext); |
| |
| The memory used for a general context should be freed by calling: |
| |
| void pcre2_general_context_free(pcre2_general_context *gcontext); |
| |
| If this function is passed a NULL argument, it returns immediately |
| without doing anything. |
| |
| The compile context |
| |
| A compile context is required if you want to provide an external func- |
| tion for stack checking during compilation or to change the default |
| values of any of the following compile-time parameters: |
| |
| What \R matches (Unicode newlines or CR, LF, CRLF only) |
| PCRE2's character tables |
| The newline character sequence |
| The compile time nested parentheses limit |
| The maximum length of the pattern string |
| The extra options bits (none set by default) |
| |
| A compile context is also required if you are using custom memory man- |
| agement. If none of these apply, just pass NULL as the context argu- |
| ment of pcre2_compile(). |
| |
| A compile context is created, copied, and freed by the following func- |
| tions: |
| |
| pcre2_compile_context *pcre2_compile_context_create( |
| pcre2_general_context *gcontext); |
| |
| pcre2_compile_context *pcre2_compile_context_copy( |
| pcre2_compile_context *ccontext); |
| |
| void pcre2_compile_context_free(pcre2_compile_context *ccontext); |
| |
| A compile context is created with default values for its parameters. |
| These can be changed by calling the following functions, which return 0 |
| on success, or PCRE2_ERROR_BADDATA if invalid data is detected. |
| |
| int pcre2_set_bsr(pcre2_compile_context *ccontext, |
| uint32_t value); |
| |
| The value must be PCRE2_BSR_ANYCRLF, to specify that \R matches only |
| CR, LF, or CRLF, or PCRE2_BSR_UNICODE, to specify that \R matches any |
| Unicode line ending sequence. The value is used by the JIT compiler and |
| by the two interpreted matching functions, pcre2_match() and |
| pcre2_dfa_match(). |
| |
| int pcre2_set_character_tables(pcre2_compile_context *ccontext, |
| const uint8_t *tables); |
| |
| The value must be the result of a call to pcre2_maketables(), whose |
| only argument is a general context. This function builds a set of char- |
| acter tables in the current locale. |
| |
| int pcre2_set_compile_extra_options(pcre2_compile_context *ccontext, |
| uint32_t extra_options); |
| |
| As PCRE2 has developed, almost all the 32 option bits that are avail- |
| able in the options argument of pcre2_compile() have been used up. To |
| avoid running out, the compile context contains a set of extra option |
| bits which are used for some newer, assumed rarer, options. This func- |
| tion sets those bits. It always sets all the bits (either on or off). |
| It does not modify any existing setting. The available options are de- |
| fined in the section entitled "Extra compile options" below. |
| |
| int pcre2_set_max_pattern_length(pcre2_compile_context *ccontext, |
| PCRE2_SIZE value); |
| |
| This sets a maximum length, in code units, for any pattern string that |
| is compiled with this context. If the pattern is longer, an error is |
| generated. This facility is provided so that applications that accept |
| patterns from external sources can limit their size. The default is the |
| largest number that a PCRE2_SIZE variable can hold, which is effec- |
| tively unlimited. |
| |
| int pcre2_set_newline(pcre2_compile_context *ccontext, |
| uint32_t value); |
| |
| This specifies which characters or character sequences are to be recog- |
| nized as newlines. The value must be one of PCRE2_NEWLINE_CR (carriage |
| return only), PCRE2_NEWLINE_LF (linefeed only), PCRE2_NEWLINE_CRLF (the |
| two-character sequence CR followed by LF), PCRE2_NEWLINE_ANYCRLF (any |
| of the above), PCRE2_NEWLINE_ANY (any Unicode newline sequence), or |
| PCRE2_NEWLINE_NUL (the NUL character, that is a binary zero). |
| |
| A pattern can override the value set in the compile context by starting |
| with a sequence such as (*CRLF). See the pcre2pattern page for details. |
| |
| When a pattern is compiled with the PCRE2_EXTENDED or PCRE2_EX- |
| TENDED_MORE option, the newline convention affects the recognition of |
| the end of internal comments starting with #. The value is saved with |
| the compiled pattern for subsequent use by the JIT compiler and by the |
| two interpreted matching functions, pcre2_match() and |
| pcre2_dfa_match(). |
| |
| int pcre2_set_parens_nest_limit(pcre2_compile_context *ccontext, |
| uint32_t value); |
| |
| This parameter adjusts the limit, set when PCRE2 is built (default |
| 250), on the depth of parenthesis nesting in a pattern. This limit |
| stops rogue patterns using up too much system stack when being com- |
| piled. The limit applies to parentheses of all kinds, not just captur- |
| ing parentheses. |
| |
| int pcre2_set_compile_recursion_guard(pcre2_compile_context *ccontext, |
| int (*guard_function)(uint32_t, void *), void *user_data); |
| |
| There is at least one application that runs PCRE2 in threads with very |
| limited system stack, where running out of stack is to be avoided at |
| all costs. The parenthesis limit above cannot take account of how much |
| stack is actually available during compilation. For a finer control, |
| you can supply a function that is called whenever pcre2_compile() |
| starts to compile a parenthesized part of a pattern. This function can |
| check the actual stack size (or anything else that it wants to, of |
| course). |
| |
| The first argument to the callout function gives the current depth of |
| nesting, and the second is user data that is set up by the last argu- |
| ment of pcre2_set_compile_recursion_guard(). The callout function |
| should return zero if all is well, or non-zero to force an error. |
| |
| The match context |
| |
| A match context is required if you want to: |
| |
| Set up a callout function |
| Set an offset limit for matching an unanchored pattern |
| Change the limit on the amount of heap used when matching |
| Change the backtracking match limit |
| Change the backtracking depth limit |
| Set custom memory management specifically for the match |
| |
| If none of these apply, just pass NULL as the context argument of |
| pcre2_match(), pcre2_dfa_match(), or pcre2_jit_match(). |
| |
| A match context is created, copied, and freed by the following func- |
| tions: |
| |
| pcre2_match_context *pcre2_match_context_create( |
| pcre2_general_context *gcontext); |
| |
| pcre2_match_context *pcre2_match_context_copy( |
| pcre2_match_context *mcontext); |
| |
| void pcre2_match_context_free(pcre2_match_context *mcontext); |
| |
| A match context is created with default values for its parameters. |
| These can be changed by calling the following functions, which return 0 |
| on success, or PCRE2_ERROR_BADDATA if invalid data is detected. |
| |
| int pcre2_set_callout(pcre2_match_context *mcontext, |
| int (*callout_function)(pcre2_callout_block *, void *), |
| void *callout_data); |
| |
| This sets up a callout function for PCRE2 to call at specified points |
| during a matching operation. Details are given in the pcre2callout doc- |
| umentation. |
| |
| int pcre2_set_substitute_callout(pcre2_match_context *mcontext, |
| int (*callout_function)(pcre2_substitute_callout_block *, void *), |
| void *callout_data); |
| |
| This sets up a callout function for PCRE2 to call after each substitu- |
| tion made by pcre2_substitute(). Details are given in the section enti- |
| tled "Creating a new string with substitutions" below. |
| |
| int pcre2_set_offset_limit(pcre2_match_context *mcontext, |
| PCRE2_SIZE value); |
| |
| The offset_limit parameter limits how far an unanchored search can ad- |
| vance in the subject string. The default value is PCRE2_UNSET. The |
| pcre2_match() and pcre2_dfa_match() functions return PCRE2_ERROR_NO- |
| MATCH if a match with a starting point before or at the given offset is |
| not found. The pcre2_substitute() function makes no more substitutions. |
| |
| For example, if the pattern /abc/ is matched against "123abc" with an |
| offset limit less than 3, the result is PCRE2_ERROR_NOMATCH. A match |
| can never be found if the startoffset argument of pcre2_match(), |
| pcre2_dfa_match(), or pcre2_substitute() is greater than the offset |
| limit set in the match context. |
| |
| When using this facility, you must set the PCRE2_USE_OFFSET_LIMIT op- |
| tion when calling pcre2_compile() so that when JIT is in use, different |
| code can be compiled. If a match is started with a non-default match |
| limit when PCRE2_USE_OFFSET_LIMIT is not set, an error is generated. |
| |
| The offset limit facility can be used to track progress when searching |
| large subject strings or to limit the extent of global substitutions. |
| See also the PCRE2_FIRSTLINE option, which requires a match to start |
| before or at the first newline that follows the start of matching in |
| the subject. If this is set with an offset limit, a match must occur in |
| the first line and also within the offset limit. In other words, which- |
| ever limit comes first is used. |
| |
| int pcre2_set_heap_limit(pcre2_match_context *mcontext, |
| uint32_t value); |
| |
| The heap_limit parameter specifies, in units of kibibytes (1024 bytes), |
| the maximum amount of heap memory that pcre2_match() may use to hold |
| backtracking information when running an interpretive match. This limit |
| also applies to pcre2_dfa_match(), which may use the heap when process- |
| ing patterns with a lot of nested pattern recursion or lookarounds or |
| atomic groups. This limit does not apply to matching with the JIT opti- |
| mization, which has its own memory control arrangements (see the |
| pcre2jit documentation for more details). If the limit is reached, the |
| negative error code PCRE2_ERROR_HEAPLIMIT is returned. The default |
| limit can be set when PCRE2 is built; if it is not, the default is set |
| very large and is essentially unlimited. |
| |
| A value for the heap limit may also be supplied by an item at the start |
| of a pattern of the form |
| |
| (*LIMIT_HEAP=ddd) |
| |
| where ddd is a decimal number. However, such a setting is ignored un- |
| less ddd is less than the limit set by the caller of pcre2_match() or, |
| if no such limit is set, less than the default. |
| |
| The pcre2_match() function always needs some heap memory, so setting a |
| value of zero guarantees a "heap limit exceeded" error. Details of how |
| pcre2_match() uses the heap are given in the pcre2perform documenta- |
| tion. |
| |
| For pcre2_dfa_match(), a vector on the system stack is used when pro- |
| cessing pattern recursions, lookarounds, or atomic groups, and only if |
| this is not big enough is heap memory used. In this case, setting a |
| value of zero disables the use of the heap. |
| |
| int pcre2_set_match_limit(pcre2_match_context *mcontext, |
| uint32_t value); |
| |
| The match_limit parameter provides a means of preventing PCRE2 from us- |
| ing up too many computing resources when processing patterns that are |
| not going to match, but which have a very large number of possibilities |
| in their search trees. The classic example is a pattern that uses |
| nested unlimited repeats. |
| |
| There is an internal counter in pcre2_match() that is incremented each |
| time round its main matching loop. If this value reaches the match |
| limit, pcre2_match() returns the negative value PCRE2_ERROR_MATCHLIMIT. |
| This has the effect of limiting the amount of backtracking that can |
| take place. For patterns that are not anchored, the count restarts from |
| zero for each position in the subject string. This limit also applies |
| to pcre2_dfa_match(), though the counting is done in a different way. |
| |
| When pcre2_match() is called with a pattern that was successfully pro- |
| cessed by pcre2_jit_compile(), the way in which matching is executed is |
| entirely different. However, there is still the possibility of runaway |
| matching that goes on for a very long time, and so the match_limit |
| value is also used in this case (but in a different way) to limit how |
| long the matching can continue. |
| |
| The default value for the limit can be set when PCRE2 is built; the de- |
| fault default is 10 million, which handles all but the most extreme |
| cases. A value for the match limit may also be supplied by an item at |
| the start of a pattern of the form |
| |
| (*LIMIT_MATCH=ddd) |
| |
| where ddd is a decimal number. However, such a setting is ignored un- |
| less ddd is less than the limit set by the caller of pcre2_match() or |
| pcre2_dfa_match() or, if no such limit is set, less than the default. |
| |
| int pcre2_set_depth_limit(pcre2_match_context *mcontext, |
| uint32_t value); |
| |
| This parameter limits the depth of nested backtracking in |
| pcre2_match(). Each time a nested backtracking point is passed, a new |
| memory frame is used to remember the state of matching at that point. |
| Thus, this parameter indirectly limits the amount of memory that is |
| used in a match. However, because the size of each memory frame depends |
| on the number of capturing parentheses, the actual memory limit varies |
| from pattern to pattern. This limit was more useful in versions before |
| 10.30, where function recursion was used for backtracking. |
| |
| The depth limit is not relevant, and is ignored, when matching is done |
| using JIT compiled code. However, it is supported by pcre2_dfa_match(), |
| which uses it to limit the depth of nested internal recursive function |
| calls that implement atomic groups, lookaround assertions, and pattern |
| recursions. This limits, indirectly, the amount of system stack that is |
| used. It was more useful in versions before 10.32, when stack memory |
| was used for local workspace vectors for recursive function calls. From |
| version 10.32, only local variables are allocated on the stack and as |
| each call uses only a few hundred bytes, even a small stack can support |
| quite a lot of recursion. |
| |
| If the depth of internal recursive function calls is great enough, lo- |
| cal workspace vectors are allocated on the heap from version 10.32 on- |
| wards, so the depth limit also indirectly limits the amount of heap |
| memory that is used. A recursive pattern such as /(.(?2))((?1)|)/, when |
| matched to a very long string using pcre2_dfa_match(), can use a great |
| deal of memory. However, it is probably better to limit heap usage di- |
| rectly by calling pcre2_set_heap_limit(). |
| |
| The default value for the depth limit can be set when PCRE2 is built; |
| if it is not, the default is set to the same value as the default for |
| the match limit. If the limit is exceeded, pcre2_match() or |
| pcre2_dfa_match() returns PCRE2_ERROR_DEPTHLIMIT. A value for the depth |
| limit may also be supplied by an item at the start of a pattern of the |
| form |
| |
| (*LIMIT_DEPTH=ddd) |
| |
| where ddd is a decimal number. However, such a setting is ignored un- |
| less ddd is less than the limit set by the caller of pcre2_match() or |
| pcre2_dfa_match() or, if no such limit is set, less than the default. |
| |
| |
| CHECKING BUILD-TIME OPTIONS |
| |
| int pcre2_config(uint32_t what, void *where); |
| |
| The function pcre2_config() makes it possible for a PCRE2 client to |
| find the value of certain configuration parameters and to discover |
| which optional features have been compiled into the PCRE2 library. The |
| pcre2build documentation has more details about these features. |
| |
| The first argument for pcre2_config() specifies which information is |
| required. The second argument is a pointer to memory into which the in- |
| formation is placed. If NULL is passed, the function returns the amount |
| of memory that is needed for the requested information. For calls that |
| return numerical values, the value is in bytes; when requesting these |
| values, where should point to appropriately aligned memory. For calls |
| that return strings, the required length is given in code units, not |
| counting the terminating zero. |
| |
| When requesting information, the returned value from pcre2_config() is |
| non-negative on success, or the negative error code PCRE2_ERROR_BADOP- |
| TION if the value in the first argument is not recognized. The follow- |
| ing information is available: |
| |
| PCRE2_CONFIG_BSR |
| |
| The output is a uint32_t integer whose value indicates what character |
| sequences the \R escape sequence matches by default. A value of |
| PCRE2_BSR_UNICODE means that \R matches any Unicode line ending se- |
| quence; a value of PCRE2_BSR_ANYCRLF means that \R matches only CR, LF, |
| or CRLF. The default can be overridden when a pattern is compiled. |
| |
| PCRE2_CONFIG_COMPILED_WIDTHS |
| |
| The output is a uint32_t integer whose lower bits indicate which code |
| unit widths were selected when PCRE2 was built. The 1-bit indicates |
| 8-bit support, and the 2-bit and 4-bit indicate 16-bit and 32-bit sup- |
| port, respectively. |
| |
| PCRE2_CONFIG_DEPTHLIMIT |
| |
| The output is a uint32_t integer that gives the default limit for the |
| depth of nested backtracking in pcre2_match() or the depth of nested |
| recursions, lookarounds, and atomic groups in pcre2_dfa_match(). Fur- |
| ther details are given with pcre2_set_depth_limit() above. |
| |
| PCRE2_CONFIG_HEAPLIMIT |
| |
| The output is a uint32_t integer that gives, in kibibytes, the default |
| limit for the amount of heap memory used by pcre2_match() or |
| pcre2_dfa_match(). Further details are given with |
| pcre2_set_heap_limit() above. |
| |
| PCRE2_CONFIG_JIT |
| |
| The output is a uint32_t integer that is set to one if support for |
| just-in-time compiling is available; otherwise it is set to zero. |
| |
| PCRE2_CONFIG_JITTARGET |
| |
| The where argument should point to a buffer that is at least 48 code |
| units long. (The exact length required can be found by calling |
| pcre2_config() with where set to NULL.) The buffer is filled with a |
| string that contains the name of the architecture for which the JIT |
| compiler is configured, for example "x86 32bit (little endian + un- |
| aligned)". If JIT support is not available, PCRE2_ERROR_BADOPTION is |
| returned, otherwise the number of code units used is returned. This is |
| the length of the string, plus one unit for the terminating zero. |
| |
| PCRE2_CONFIG_LINKSIZE |
| |
| The output is a uint32_t integer that contains the number of bytes used |
| for internal linkage in compiled regular expressions. When PCRE2 is |
| configured, the value can be set to 2, 3, or 4, with the default being |
| 2. This is the value that is returned by pcre2_config(). However, when |
| the 16-bit library is compiled, a value of 3 is rounded up to 4, and |
| when the 32-bit library is compiled, internal linkages always use 4 |
| bytes, so the configured value is not relevant. |
| |
| The default value of 2 for the 8-bit and 16-bit libraries is sufficient |
| for all but the most massive patterns, since it allows the size of the |
| compiled pattern to be up to 65535 code units. Larger values allow |
| larger regular expressions to be compiled by those two libraries, but |
| at the expense of slower matching. |
| |
| PCRE2_CONFIG_MATCHLIMIT |
| |
| The output is a uint32_t integer that gives the default match limit for |
| pcre2_match(). Further details are given with pcre2_set_match_limit() |
| above. |
| |
| PCRE2_CONFIG_NEWLINE |
| |
| The output is a uint32_t integer whose value specifies the default |
| character sequence that is recognized as meaning "newline". The values |
| are: |
| |
| PCRE2_NEWLINE_CR Carriage return (CR) |
| PCRE2_NEWLINE_LF Linefeed (LF) |
| PCRE2_NEWLINE_CRLF Carriage return, linefeed (CRLF) |
| PCRE2_NEWLINE_ANY Any Unicode line ending |
| PCRE2_NEWLINE_ANYCRLF Any of CR, LF, or CRLF |
| PCRE2_NEWLINE_NUL The NUL character (binary zero) |
| |
| The default should normally correspond to the standard sequence for |
| your operating system. |
| |
| PCRE2_CONFIG_NEVER_BACKSLASH_C |
| |
| The output is a uint32_t integer that is set to one if the use of \C |
| was permanently disabled when PCRE2 was built; otherwise it is set to |
| zero. |
| |
| PCRE2_CONFIG_PARENSLIMIT |
| |
| The output is a uint32_t integer that gives the maximum depth of nest- |
| ing of parentheses (of any kind) in a pattern. This limit is imposed to |
| cap the amount of system stack used when a pattern is compiled. It is |
| specified when PCRE2 is built; the default is 250. This limit does not |
| take into account the stack that may already be used by the calling ap- |
| plication. For finer control over compilation stack usage, see |
| pcre2_set_compile_recursion_guard(). |
| |
| PCRE2_CONFIG_STACKRECURSE |
| |
| This parameter is obsolete and should not be used in new code. The out- |
| put is a uint32_t integer that is always set to zero. |
| |
| PCRE2_CONFIG_TABLES_LENGTH |
| |
| The output is a uint32_t integer that gives the length of PCRE2's char- |
| acter processing tables in bytes. For details of these tables see the |
| section on locale support below. |
| |
| PCRE2_CONFIG_UNICODE_VERSION |
| |
| The where argument should point to a buffer that is at least 24 code |
| units long. (The exact length required can be found by calling |
| pcre2_config() with where set to NULL.) If PCRE2 has been compiled |
| without Unicode support, the buffer is filled with the text "Unicode |
| not supported". Otherwise, the Unicode version string (for example, |
| "8.0.0") is inserted. The number of code units used is returned. This |
| is the length of the string plus one unit for the terminating zero. |
| |
| PCRE2_CONFIG_UNICODE |
| |
| The output is a uint32_t integer that is set to one if Unicode support |
| is available; otherwise it is set to zero. Unicode support implies UTF |
| support. |
| |
| PCRE2_CONFIG_VERSION |
| |
| The where argument should point to a buffer that is at least 24 code |
| units long. (The exact length required can be found by calling |
| pcre2_config() with where set to NULL.) The buffer is filled with the |
| PCRE2 version string, zero-terminated. The number of code units used is |
| returned. This is the length of the string plus one unit for the termi- |
| nating zero. |
| |
| |
| COMPILING A PATTERN |
| |
| pcre2_code *pcre2_compile(PCRE2_SPTR pattern, PCRE2_SIZE length, |
| uint32_t options, int *errorcode, PCRE2_SIZE *erroroffset, |
| pcre2_compile_context *ccontext); |
| |
| void pcre2_code_free(pcre2_code *code); |
| |
| pcre2_code *pcre2_code_copy(const pcre2_code *code); |
| |
| pcre2_code *pcre2_code_copy_with_tables(const pcre2_code *code); |
| |
| The pcre2_compile() function compiles a pattern into an internal form. |
| The pattern is defined by a pointer to a string of code units and a |
| length (in code units). If the pattern is zero-terminated, the length |
| can be specified as PCRE2_ZERO_TERMINATED. The function returns a |
| pointer to a block of memory that contains the compiled pattern and re- |
| lated data, or NULL if an error occurred. |
| |
| If the compile context argument ccontext is NULL, memory for the com- |
| piled pattern is obtained by calling malloc(). Otherwise, it is ob- |
| tained from the same memory function that was used for the compile con- |
| text. The caller must free the memory by calling pcre2_code_free() when |
| it is no longer needed. If pcre2_code_free() is called with a NULL ar- |
| gument, it returns immediately, without doing anything. |
| |
| The function pcre2_code_copy() makes a copy of the compiled code in new |
| memory, using the same memory allocator as was used for the original. |
| However, if the code has been processed by the JIT compiler (see be- |
| low), the JIT information cannot be copied (because it is position-de- |
| pendent). The new copy can initially be used only for non-JIT match- |
| ing, though it can be passed to pcre2_jit_compile() if required. If |
| pcre2_code_copy() is called with a NULL argument, it returns NULL. |
| |
| The pcre2_code_copy() function provides a way for individual threads in |
| a multithreaded application to acquire a private copy of shared com- |
| piled code. However, it does not make a copy of the character tables |
| used by the compiled pattern; the new pattern code points to the same |
| tables as the original code. (See "Locale Support" below for details |
| of these character tables.) In many applications the same tables are |
| used throughout, so this behaviour is appropriate. Nevertheless, there |
| are occasions when a copy of a compiled pattern and the relevant tables |
| are needed. The pcre2_code_copy_with_tables() provides this facility. |
| Copies of both the code and the tables are made, with the new code |
| pointing to the new tables. The memory for the new tables is automati- |
| cally freed when pcre2_code_free() is called for the new copy of the |
| compiled code. If pcre2_code_copy_with_tables() is called with a NULL |
| argument, it returns NULL. |
| |
| NOTE: When one of the matching functions is called, pointers to the |
| compiled pattern and the subject string are set in the match data block |
| so that they can be referenced by the substring extraction functions |
| after a successful match. After running a match, you must not free a |
| compiled pattern or a subject string until after all operations on the |
| match data block have taken place, unless, in the case of the subject |
| string, you have used the PCRE2_COPY_MATCHED_SUBJECT option, which is |
| described in the section entitled "Option bits for pcre2_match()" be- |
| low. |
| |
| The options argument for pcre2_compile() contains various bit settings |
| that affect the compilation. It should be zero if none of them are re- |
| quired. The available options are described below. Some of them (in |
| particular, those that are compatible with Perl, but some others as |
| well) can also be set and unset from within the pattern (see the de- |
| tailed description in the pcre2pattern documentation). |
| |
| For those options that can be different in different parts of the pat- |
| tern, the contents of the options argument specifies their settings at |
| the start of compilation. The PCRE2_ANCHORED, PCRE2_ENDANCHORED, and |
| PCRE2_NO_UTF_CHECK options can be set at the time of matching as well |
| as at compile time. |
| |
| Some additional options and less frequently required compile-time pa- |
| rameters (for example, the newline setting) can be provided in a com- |
| pile context (as described above). |
| |
| If errorcode or erroroffset is NULL, pcre2_compile() returns NULL imme- |
| diately. Otherwise, the variables to which these point are set to an |
| error code and an offset (number of code units) within the pattern, re- |
| spectively, when pcre2_compile() returns NULL because a compilation er- |
| ror has occurred. |
| |
| There are nearly 100 positive error codes that pcre2_compile() may re- |
| turn if it finds an error in the pattern. There are also some negative |
| error codes that are used for invalid UTF strings when validity check- |
| ing is in force. These are the same as given by pcre2_match() and |
| pcre2_dfa_match(), and are described in the pcre2unicode documentation. |
| There is no separate documentation for the positive error codes, be- |
| cause the textual error messages that are obtained by calling the |
| pcre2_get_error_message() function (see "Obtaining a textual error mes- |
| sage" below) should be self-explanatory. Macro names starting with |
| PCRE2_ERROR_ are defined for both positive and negative error codes in |
| pcre2.h. When compilation is successful errorcode is set to a value |
| that returns the message "no error" if passed to pcre2_get_error_mes- |
| sage(). |
| |
| The value returned in erroroffset is an indication of where in the pat- |
| tern an error occurred. When there is no error, zero is returned. A |
| non-zero value is not necessarily the furthest point in the pattern |
| that was read. For example, after the error "lookbehind assertion is |
| not fixed length", the error offset points to the start of the failing |
| assertion. For an invalid UTF-8 or UTF-16 string, the offset is that of |
| the first code unit of the failing character. |
| |
| Some errors are not detected until the whole pattern has been scanned; |
| in these cases, the offset passed back is the length of the pattern. |
| Note that the offset is in code units, not characters, even in a UTF |
| mode. It may sometimes point into the middle of a UTF-8 or UTF-16 char- |
| acter. |
| |
| This code fragment shows a typical straightforward call to pcre2_com- |
| pile(): |
| |
| pcre2_code *re; |
| PCRE2_SIZE erroffset; |
| int errorcode; |
| re = pcre2_compile( |
| "^A.*Z", /* the pattern */ |
| PCRE2_ZERO_TERMINATED, /* the pattern is zero-terminated */ |
| 0, /* default options */ |
| &errorcode, /* for error code */ |
| &erroffset, /* for error offset */ |
| NULL); /* no compile context */ |
| |
| |
| Main compile options |
| |
| The following names for option bits are defined in the pcre2.h header |
| file: |
| |
| PCRE2_ANCHORED |
| |
| If this bit is set, the pattern is forced to be "anchored", that is, it |
| is constrained to match only at the first matching point in the string |
| that is being searched (the "subject string"). This effect can also be |
| achieved by appropriate constructs in the pattern itself, which is the |
| only way to do it in Perl. |
| |
| PCRE2_ALLOW_EMPTY_CLASS |
| |
| By default, for compatibility with Perl, a closing square bracket that |
| immediately follows an opening one is treated as a data character for |
| the class. When PCRE2_ALLOW_EMPTY_CLASS is set, it terminates the |
| class, which therefore contains no characters and so can never match. |
| |
| PCRE2_ALT_BSUX |
| |
| This option request alternative handling of three escape sequences, |
| which makes PCRE2's behaviour more like ECMAscript (aka JavaScript). |
| When it is set: |
| |
| (1) \U matches an upper case "U" character; by default \U causes a com- |
| pile time error (Perl uses \U to upper case subsequent characters). |
| |
| (2) \u matches a lower case "u" character unless it is followed by four |
| hexadecimal digits, in which case the hexadecimal number defines the |
| code point to match. By default, \u causes a compile time error (Perl |
| uses it to upper case the following character). |
| |
| (3) \x matches a lower case "x" character unless it is followed by two |
| hexadecimal digits, in which case the hexadecimal number defines the |
| code point to match. By default, as in Perl, a hexadecimal number is |
| always expected after \x, but it may have zero, one, or two digits (so, |
| for example, \xz matches a binary zero character followed by z). |
| |
| ECMAscript 6 added additional functionality to \u. This can be accessed |
| using the PCRE2_EXTRA_ALT_BSUX extra option (see "Extra compile op- |
| tions" below). Note that this alternative escape handling applies only |
| to patterns. Neither of these options affects the processing of re- |
| placement strings passed to pcre2_substitute(). |
| |
| PCRE2_ALT_CIRCUMFLEX |
| |
| In multiline mode (when PCRE2_MULTILINE is set), the circumflex |
| metacharacter matches at the start of the subject (unless PCRE2_NOTBOL |
| is set), and also after any internal newline. However, it does not |
| match after a newline at the end of the subject, for compatibility with |
| Perl. If you want a multiline circumflex also to match after a termi- |
| nating newline, you must set PCRE2_ALT_CIRCUMFLEX. |
| |
| PCRE2_ALT_VERBNAMES |
| |
| By default, for compatibility with Perl, the name in any verb sequence |
| such as (*MARK:NAME) is any sequence of characters that does not in- |
| clude a closing parenthesis. The name is not processed in any way, and |
| it is not possible to include a closing parenthesis in the name. How- |
| ever, if the PCRE2_ALT_VERBNAMES option is set, normal backslash pro- |
| cessing is applied to verb names and only an unescaped closing paren- |
| thesis terminates the name. A closing parenthesis can be included in a |
| name either as \) or between \Q and \E. If the PCRE2_EXTENDED or |
| PCRE2_EXTENDED_MORE option is set with PCRE2_ALT_VERBNAMES, unescaped |
| whitespace in verb names is skipped and #-comments are recognized, ex- |
| actly as in the rest of the pattern. |
| |
| PCRE2_AUTO_CALLOUT |
| |
| If this bit is set, pcre2_compile() automatically inserts callout |
| items, all with number 255, before each pattern item, except immedi- |
| ately before or after an explicit callout in the pattern. For discus- |
| sion of the callout facility, see the pcre2callout documentation. |
| |
| PCRE2_CASELESS |
| |
| If this bit is set, letters in the pattern match both upper and lower |
| case letters in the subject. It is equivalent to Perl's /i option, and |
| it can be changed within a pattern by a (?i) option setting. If either |
| PCRE2_UTF or PCRE2_UCP is set, Unicode properties are used for all |
| characters with more than one other case, and for all characters whose |
| code points are greater than U+007F. Note that there are two ASCII |
| characters, K and S, that, in addition to their lower case ASCII equiv- |
| alents, are case-equivalent with U+212A (Kelvin sign) and U+017F (long |
| S) respectively. For lower valued characters with only one other case, |
| a lookup table is used for speed. When neither PCRE2_UTF nor PCRE2_UCP |
| is set, a lookup table is used for all code points less than 256, and |
| higher code points (available only in 16-bit or 32-bit mode) are |
| treated as not having another case. |
| |
| PCRE2_DOLLAR_ENDONLY |
| |
| If this bit is set, a dollar metacharacter in the pattern matches only |
| at the end of the subject string. Without this option, a dollar also |
| matches immediately before a newline at the end of the string (but not |
| before any other newlines). The PCRE2_DOLLAR_ENDONLY option is ignored |
| if PCRE2_MULTILINE is set. There is no equivalent to this option in |
| Perl, and no way to set it within a pattern. |
| |
| PCRE2_DOTALL |
| |
| If this bit is set, a dot metacharacter in the pattern matches any |
| character, including one that indicates a newline. However, it only |
| ever matches one character, even if newlines are coded as CRLF. Without |
| this option, a dot does not match when the current position in the sub- |
| ject is at a newline. This option is equivalent to Perl's /s option, |
| and it can be changed within a pattern by a (?s) option setting. A neg- |
| ative class such as [^a] always matches newline characters, and the \N |
| escape sequence always matches a non-newline character, independent of |
| the setting of PCRE2_DOTALL. |
| |
| PCRE2_DUPNAMES |
| |
| If this bit is set, names used to identify capture groups need not be |
| unique. This can be helpful for certain types of pattern when it is |
| known that only one instance of the named group can ever be matched. |
| There are more details of named capture groups below; see also the |
| pcre2pattern documentation. |
| |
| PCRE2_ENDANCHORED |
| |
| If this bit is set, the end of any pattern match must be right at the |
| end of the string being searched (the "subject string"). If the pattern |
| match succeeds by reaching (*ACCEPT), but does not reach the end of the |
| subject, the match fails at the current starting point. For unanchored |
| patterns, a new match is then tried at the next starting point. How- |
| ever, if the match succeeds by reaching the end of the pattern, but not |
| the end of the subject, backtracking occurs and an alternative match |
| may be found. Consider these two patterns: |
| |
| .(*ACCEPT)|.. |
| .|.. |
| |
| If matched against "abc" with PCRE2_ENDANCHORED set, the first matches |
| "c" whereas the second matches "bc". The effect of PCRE2_ENDANCHORED |
| can also be achieved by appropriate constructs in the pattern itself, |
| which is the only way to do it in Perl. |
| |
| For DFA matching with pcre2_dfa_match(), PCRE2_ENDANCHORED applies only |
| to the first (that is, the longest) matched string. Other parallel |
| matches, which are necessarily substrings of the first one, must obvi- |
| ously end before the end of the subject. |
| |
| PCRE2_EXTENDED |
| |
| If this bit is set, most white space characters in the pattern are to- |
| tally ignored except when escaped or inside a character class. However, |
| white space is not allowed within sequences such as (?> that introduce |
| various parenthesized groups, nor within numerical quantifiers such as |
| {1,3}. Ignorable white space is permitted between an item and a follow- |
| ing quantifier and between a quantifier and a following + that indi- |
| cates possessiveness. PCRE2_EXTENDED is equivalent to Perl's /x option, |
| and it can be changed within a pattern by a (?x) option setting. |
| |
| When PCRE2 is compiled without Unicode support, PCRE2_EXTENDED recog- |
| nizes as white space only those characters with code points less than |
| 256 that are flagged as white space in its low-character table. The ta- |
| ble is normally created by pcre2_maketables(), which uses the isspace() |
| function to identify space characters. In most ASCII environments, the |
| relevant characters are those with code points 0x0009 (tab), 0x000A |
| (linefeed), 0x000B (vertical tab), 0x000C (formfeed), 0x000D (carriage |
| return), and 0x0020 (space). |
| |
| When PCRE2 is compiled with Unicode support, in addition to these char- |
| acters, five more Unicode "Pattern White Space" characters are recog- |
| nized by PCRE2_EXTENDED. These are U+0085 (next line), U+200E (left-to- |
| right mark), U+200F (right-to-left mark), U+2028 (line separator), and |
| U+2029 (paragraph separator). This set of characters is the same as |
| recognized by Perl's /x option. Note that the horizontal and vertical |
| space characters that are matched by the \h and \v escapes in patterns |
| are a much bigger set. |
| |
| As well as ignoring most white space, PCRE2_EXTENDED also causes char- |
| acters between an unescaped # outside a character class and the next |
| newline, inclusive, to be ignored, which makes it possible to include |
| comments inside complicated patterns. Note that the end of this type of |
| comment is a literal newline sequence in the pattern; escape sequences |
| that happen to represent a newline do not count. |
| |
| Which characters are interpreted as newlines can be specified by a set- |
| ting in the compile context that is passed to pcre2_compile() or by a |
| special sequence at the start of the pattern, as described in the sec- |
| tion entitled "Newline conventions" in the pcre2pattern documentation. |
| A default is defined when PCRE2 is built. |
| |
| PCRE2_EXTENDED_MORE |
| |
| This option has the effect of PCRE2_EXTENDED, but, in addition, un- |
| escaped space and horizontal tab characters are ignored inside a char- |
| acter class. Note: only these two characters are ignored, not the full |
| set of pattern white space characters that are ignored outside a char- |
| acter class. PCRE2_EXTENDED_MORE is equivalent to Perl's /xx option, |
| and it can be changed within a pattern by a (?xx) option setting. |
| |
| PCRE2_FIRSTLINE |
| |
| If this option is set, the start of an unanchored pattern match must be |
| before or at the first newline in the subject string following the |
| start of matching, though the matched text may continue over the new- |
| line. If startoffset is non-zero, the limiting newline is not necessar- |
| ily the first newline in the subject. For example, if the subject |
| string is "abc\nxyz" (where \n represents a single-character newline) a |
| pattern match for "yz" succeeds with PCRE2_FIRSTLINE if startoffset is |
| greater than 3. See also PCRE2_USE_OFFSET_LIMIT, which provides a more |
| general limiting facility. If PCRE2_FIRSTLINE is set with an offset |
| limit, a match must occur in the first line and also within the offset |
| limit. In other words, whichever limit comes first is used. |
| |
| PCRE2_LITERAL |
| |
| If this option is set, all meta-characters in the pattern are disabled, |
| and it is treated as a literal string. Matching literal strings with a |
| regular expression engine is not the most efficient way of doing it. If |
| you are doing a lot of literal matching and are worried about effi- |
| ciency, you should consider using other approaches. The only other main |
| options that are allowed with PCRE2_LITERAL are: PCRE2_ANCHORED, |
| PCRE2_ENDANCHORED, PCRE2_AUTO_CALLOUT, PCRE2_CASELESS, PCRE2_FIRSTLINE, |
| PCRE2_MATCH_INVALID_UTF, PCRE2_NO_START_OPTIMIZE, PCRE2_NO_UTF_CHECK, |
| PCRE2_UTF, and PCRE2_USE_OFFSET_LIMIT. The extra options PCRE2_EX- |
| TRA_MATCH_LINE and PCRE2_EXTRA_MATCH_WORD are also supported. Any other |
| options cause an error. |
| |
| PCRE2_MATCH_INVALID_UTF |
| |
| This option forces PCRE2_UTF (see below) and also enables support for |
| matching by pcre2_match() in subject strings that contain invalid UTF |
| sequences. This facility is not supported for DFA matching. For de- |
| tails, see the pcre2unicode documentation. |
| |
| PCRE2_MATCH_UNSET_BACKREF |
| |
| If this option is set, a backreference to an unset capture group |
| matches an empty string (by default this causes the current matching |
| alternative to fail). A pattern such as (\1)(a) succeeds when this op- |
| tion is set (assuming it can find an "a" in the subject), whereas it |
| fails by default, for Perl compatibility. Setting this option makes |
| PCRE2 behave more like ECMAscript (aka JavaScript). |
| |
| PCRE2_MULTILINE |
| |
| By default, for the purposes of matching "start of line" and "end of |
| line", PCRE2 treats the subject string as consisting of a single line |
| of characters, even if it actually contains newlines. The "start of |
| line" metacharacter (^) matches only at the start of the string, and |
| the "end of line" metacharacter ($) matches only at the end of the |
| string, or before a terminating newline (except when PCRE2_DOLLAR_EN- |
| DONLY is set). Note, however, that unless PCRE2_DOTALL is set, the "any |
| character" metacharacter (.) does not match at a newline. This behav- |
| iour (for ^, $, and dot) is the same as Perl. |
| |
| When PCRE2_MULTILINE it is set, the "start of line" and "end of line" |
| constructs match immediately following or immediately before internal |
| newlines in the subject string, respectively, as well as at the very |
| start and end. This is equivalent to Perl's /m option, and it can be |
| changed within a pattern by a (?m) option setting. Note that the "start |
| of line" metacharacter does not match after a newline at the end of the |
| subject, for compatibility with Perl. However, you can change this by |
| setting the PCRE2_ALT_CIRCUMFLEX option. If there are no newlines in a |
| subject string, or no occurrences of ^ or $ in a pattern, setting |
| PCRE2_MULTILINE has no effect. |
| |
| PCRE2_NEVER_BACKSLASH_C |
| |
| This option locks out the use of \C in the pattern that is being com- |
| piled. This escape can cause unpredictable behaviour in UTF-8 or |
| UTF-16 modes, because it may leave the current matching point in the |
| middle of a multi-code-unit character. This option may be useful in ap- |
| plications that process patterns from external sources. Note that there |
| is also a build-time option that permanently locks out the use of \C. |
| |
| PCRE2_NEVER_UCP |
| |
| This option locks out the use of Unicode properties for handling \B, |
| \b, \D, \d, \S, \s, \W, \w, and some of the POSIX character classes, as |
| described for the PCRE2_UCP option below. In particular, it prevents |
| the creator of the pattern from enabling this facility by starting the |
| pattern with (*UCP). This option may be useful in applications that |
| process patterns from external sources. The option combination PCRE_UCP |
| and PCRE_NEVER_UCP causes an error. |
| |
| PCRE2_NEVER_UTF |
| |
| This option locks out interpretation of the pattern as UTF-8, UTF-16, |
| or UTF-32, depending on which library is in use. In particular, it pre- |
| vents the creator of the pattern from switching to UTF interpretation |
| by starting the pattern with (*UTF). This option may be useful in ap- |
| plications that process patterns from external sources. The combination |
| of PCRE2_UTF and PCRE2_NEVER_UTF causes an error. |
| |
| PCRE2_NO_AUTO_CAPTURE |
| |
| If this option is set, it disables the use of numbered capturing paren- |
| theses in the pattern. Any opening parenthesis that is not followed by |
| ? behaves as if it were followed by ?: but named parentheses can still |
| be used for capturing (and they acquire numbers in the usual way). This |
| is the same as Perl's /n option. Note that, when this option is set, |
| references to capture groups (backreferences or recursion/subroutine |
| calls) may only refer to named groups, though the reference can be by |
| name or by number. |
| |
| PCRE2_NO_AUTO_POSSESS |
| |
| If this option is set, it disables "auto-possessification", which is an |
| optimization that, for example, turns a+b into a++b in order to avoid |
| backtracks into a+ that can never be successful. However, if callouts |
| are in use, auto-possessification means that some callouts are never |
| taken. You can set this option if you want the matching functions to do |
| a full unoptimized search and run all the callouts, but it is mainly |
| provided for testing purposes. |
| |
| PCRE2_NO_DOTSTAR_ANCHOR |
| |
| If this option is set, it disables an optimization that is applied when |
| .* is the first significant item in a top-level branch of a pattern, |
| and all the other branches also start with .* or with \A or \G or ^. |
| The optimization is automatically disabled for .* if it is inside an |
| atomic group or a capture group that is the subject of a backreference, |
| or if the pattern contains (*PRUNE) or (*SKIP). When the optimization |
| is not disabled, such a pattern is automatically anchored if |
| PCRE2_DOTALL is set for all the .* items and PCRE2_MULTILINE is not set |
| for any ^ items. Otherwise, the fact that any match must start either |
| at the start of the subject or following a newline is remembered. Like |
| other optimizations, this can cause callouts to be skipped. |
| |
| PCRE2_NO_START_OPTIMIZE |
| |
| This is an option whose main effect is at matching time. It does not |
| change what pcre2_compile() generates, but it does affect the output of |
| the JIT compiler. |
| |
| There are a number of optimizations that may occur at the start of a |
| match, in order to speed up the process. For example, if it is known |
| that an unanchored match must start with a specific code unit value, |
| the matching code searches the subject for that value, and fails imme- |
| diately if it cannot find it, without actually running the main match- |
| ing function. This means that a special item such as (*COMMIT) at the |
| start of a pattern is not considered until after a suitable starting |
| point for the match has been found. Also, when callouts or (*MARK) |
| items are in use, these "start-up" optimizations can cause them to be |
| skipped if the pattern is never actually used. The start-up optimiza- |
| tions are in effect a pre-scan of the subject that takes place before |
| the pattern is run. |
| |
| The PCRE2_NO_START_OPTIMIZE option disables the start-up optimizations, |
| possibly causing performance to suffer, but ensuring that in cases |
| where the result is "no match", the callouts do occur, and that items |
| such as (*COMMIT) and (*MARK) are considered at every possible starting |
| position in the subject string. |
| |
| Setting PCRE2_NO_START_OPTIMIZE may change the outcome of a matching |
| operation. Consider the pattern |
| |
| (*COMMIT)ABC |
| |
| When this is compiled, PCRE2 records the fact that a match must start |
| with the character "A". Suppose the subject string is "DEFABC". The |
| start-up optimization scans along the subject, finds "A" and runs the |
| first match attempt from there. The (*COMMIT) item means that the pat- |
| tern must match the current starting position, which in this case, it |
| does. However, if the same match is run with PCRE2_NO_START_OPTIMIZE |
| set, the initial scan along the subject string does not happen. The |
| first match attempt is run starting from "D" and when this fails, |
| (*COMMIT) prevents any further matches being tried, so the overall re- |
| sult is "no match". |
| |
| As another start-up optimization makes use of a minimum length for a |
| matching subject, which is recorded when possible. Consider the pattern |
| |
| (*MARK:1)B(*MARK:2)(X|Y) |
| |
| The minimum length for a match is two characters. If the subject is |
| "XXBB", the "starting character" optimization skips "XX", then tries to |
| match "BB", which is long enough. In the process, (*MARK:2) is encoun- |
| tered and remembered. When the match attempt fails, the next "B" is |
| found, but there is only one character left, so there are no more at- |
| tempts, and "no match" is returned with the "last mark seen" set to |
| "2". If NO_START_OPTIMIZE is set, however, matches are tried at every |
| possible starting position, including at the end of the subject, where |
| (*MARK:1) is encountered, but there is no "B", so the "last mark seen" |
| that is returned is "1". In this case, the optimizations do not affect |
| the overall match result, which is still "no match", but they do affect |
| the auxiliary information that is returned. |
| |
| PCRE2_NO_UTF_CHECK |
| |
| When PCRE2_UTF is set, the validity of the pattern as a UTF string is |
| automatically checked. There are discussions about the validity of |
| UTF-8 strings, UTF-16 strings, and UTF-32 strings in the pcre2unicode |
| document. If an invalid UTF sequence is found, pcre2_compile() returns |
| a negative error code. |
| |
| If you know that your pattern is a valid UTF string, and you want to |
| skip this check for performance reasons, you can set the |
| PCRE2_NO_UTF_CHECK option. When it is set, the effect of passing an in- |
| valid UTF string as a pattern is undefined. It may cause your program |
| to crash or loop. |
| |
| Note that this option can also be passed to pcre2_match() and |
| pcre2_dfa_match(), to suppress UTF validity checking of the subject |
| string. |
| |
| Note also that setting PCRE2_NO_UTF_CHECK at compile time does not dis- |
| able the error that is given if an escape sequence for an invalid Uni- |
| code code point is encountered in the pattern. In particular, the so- |
| called "surrogate" code points (0xd800 to 0xdfff) are invalid. If you |
| want to allow escape sequences such as \x{d800} you can set the |
| PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES extra option, as described in the |
| section entitled "Extra compile options" below. However, this is pos- |
| sible only in UTF-8 and UTF-32 modes, because these values are not rep- |
| resentable in UTF-16. |
| |
| PCRE2_UCP |
| |
| This option has two effects. Firstly, it change the way PCRE2 processes |
| \B, \b, \D, \d, \S, \s, \W, \w, and some of the POSIX character |
| classes. By default, only ASCII characters are recognized, but if |
| PCRE2_UCP is set, Unicode properties are used instead to classify char- |
| acters. More details are given in the section on generic character |
| types in the pcre2pattern page. If you set PCRE2_UCP, matching one of |
| the items it affects takes much longer. |
| |
| The second effect of PCRE2_UCP is to force the use of Unicode proper- |
| ties for upper/lower casing operations on characters with code points |
| greater than 127, even when PCRE2_UTF is not set. This makes it possi- |
| ble, for example, to process strings in the 16-bit UCS-2 code. This op- |
| tion is available only if PCRE2 has been compiled with Unicode support |
| (which is the default). |
| |
| PCRE2_UNGREEDY |
| |
| This option inverts the "greediness" of the quantifiers so that they |
| are not greedy by default, but become greedy if followed by "?". It is |
| not compatible with Perl. It can also be set by a (?U) option setting |
| within the pattern. |
| |
| PCRE2_USE_OFFSET_LIMIT |
| |
| This option must be set for pcre2_compile() if pcre2_set_offset_limit() |
| is going to be used to set a non-default offset limit in a match con- |
| text for matches that use this pattern. An error is generated if an |
| offset limit is set without this option. For more details, see the de- |
| scription of pcre2_set_offset_limit() in the section that describes |
| match contexts. See also the PCRE2_FIRSTLINE option above. |
| |
| PCRE2_UTF |
| |
| This option causes PCRE2 to regard both the pattern and the subject |
| strings that are subsequently processed as strings of UTF characters |
| instead of single-code-unit strings. It is available when PCRE2 is |
| built to include Unicode support (which is the default). If Unicode |
| support is not available, the use of this option provokes an error. De- |
| tails of how PCRE2_UTF changes the behaviour of PCRE2 are given in the |
| pcre2unicode page. In particular, note that it changes the way |
| PCRE2_CASELESS handles characters with code points greater than 127. |
| |
| Extra compile options |
| |
| The option bits that can be set in a compile context by calling the |
| pcre2_set_compile_extra_options() function are as follows: |
| |
| PCRE2_EXTRA_ALLOW_LOOKAROUND_BSK |
| |
| Since release 10.38 PCRE2 has forbidden the use of \K within lookaround |
| assertions, following Perl's lead. This option is provided to re-enable |
| the previous behaviour (act in positive lookarounds, ignore in negative |
| ones) in case anybody is relying on it. |
| |
| PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES |
| |
| This option applies when compiling a pattern in UTF-8 or UTF-32 mode. |
| It is forbidden in UTF-16 mode, and ignored in non-UTF modes. Unicode |
| "surrogate" code points in the range 0xd800 to 0xdfff are used in pairs |
| in UTF-16 to encode code points with values in the range 0x10000 to |
| 0x10ffff. The surrogates cannot therefore be represented in UTF-16. |
| They can be represented in UTF-8 and UTF-32, but are defined as invalid |
| code points, and cause errors if encountered in a UTF-8 or UTF-32 |
| string that is being checked for validity by PCRE2. |
| |
| These values also cause errors if encountered in escape sequences such |
| as \x{d912} within a pattern. However, it seems that some applications, |
| when using PCRE2 to check for unwanted characters in UTF-8 strings, ex- |
| plicitly test for the surrogates using escape sequences. The |
| PCRE2_NO_UTF_CHECK option does not disable the error that occurs, be- |
| cause it applies only to the testing of input strings for UTF validity. |
| |
| If the extra option PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES is set, surro- |
| gate code point values in UTF-8 and UTF-32 patterns no longer provoke |
| errors and are incorporated in the compiled pattern. However, they can |
| only match subject characters if the matching function is called with |
| PCRE2_NO_UTF_CHECK set. |
| |
| PCRE2_EXTRA_ALT_BSUX |
| |
| The original option PCRE2_ALT_BSUX causes PCRE2 to process \U, \u, and |
| \x in the way that ECMAscript (aka JavaScript) does. Additional func- |
| tionality was defined by ECMAscript 6; setting PCRE2_EXTRA_ALT_BSUX has |
| the effect of PCRE2_ALT_BSUX, but in addition it recognizes \u{hhh..} |
| as a hexadecimal character code, where hhh.. is any number of hexadeci- |
| mal digits. |
| |
| PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL |
| |
| This is a dangerous option. Use with care. By default, an unrecognized |
| escape such as \j or a malformed one such as \x{2z} causes a compile- |
| time error when detected by pcre2_compile(). Perl is somewhat inconsis- |
| tent in handling such items: for example, \j is treated as a literal |
| "j", and non-hexadecimal digits in \x{} are just ignored, though warn- |
| ings are given in both cases if Perl's warning switch is enabled. How- |
| ever, a malformed octal number after \o{ always causes an error in |
| Perl. |
| |
| If the PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL extra option is passed to |
| pcre2_compile(), all unrecognized or malformed escape sequences are |
| treated as single-character escapes. For example, \j is a literal "j" |
| and \x{2z} is treated as the literal string "x{2z}". Setting this op- |
| tion means that typos in patterns may go undetected and have unexpected |
| results. Also note that a sequence such as [\N{] is interpreted as a |
| malformed attempt at [\N{...}] and so is treated as [N{] whereas [\N] |
| gives an error because an unqualified \N is a valid escape sequence but |
| is not supported in a character class. To reiterate: this is a danger- |
| ous option. Use with great care. |
| |
| PCRE2_EXTRA_ESCAPED_CR_IS_LF |
| |
| There are some legacy applications where the escape sequence \r in a |
| pattern is expected to match a newline. If this option is set, \r in a |
| pattern is converted to \n so that it matches a LF (linefeed) instead |
| of a CR (carriage return) character. The option does not affect a lit- |
| eral CR in the pattern, nor does it affect CR specified as an explicit |
| code point such as \x{0D}. |
| |
| PCRE2_EXTRA_MATCH_LINE |
| |
| This option is provided for use by the -x option of pcre2grep. It |
| causes the pattern only to match complete lines. This is achieved by |
| automatically inserting the code for "^(?:" at the start of the com- |
| piled pattern and ")$" at the end. Thus, when PCRE2_MULTILINE is set, |
| the matched line may be in the middle of the subject string. This op- |
| tion can be used with PCRE2_LITERAL. |
| |
| PCRE2_EXTRA_MATCH_WORD |
| |
| This option is provided for use by the -w option of pcre2grep. It |
| causes the pattern only to match strings that have a word boundary at |
| the start and the end. This is achieved by automatically inserting the |
| code for "\b(?:" at the start of the compiled pattern and ")\b" at the |
| end. The option may be used with PCRE2_LITERAL. However, it is ignored |
| if PCRE2_EXTRA_MATCH_LINE is also set. |
| |
| |
| JUST-IN-TIME (JIT) COMPILATION |
| |
| int pcre2_jit_compile(pcre2_code *code, uint32_t options); |
| |
| int pcre2_jit_match(const pcre2_code *code, PCRE2_SPTR subject, |
| PCRE2_SIZE length, PCRE2_SIZE startoffset, |
| uint32_t options, pcre2_match_data *match_data, |
| pcre2_match_context *mcontext); |
| |
| void pcre2_jit_free_unused_memory(pcre2_general_context *gcontext); |
| |
| pcre2_jit_stack *pcre2_jit_stack_create(PCRE2_SIZE startsize, |
| PCRE2_SIZE maxsize, pcre2_general_context *gcontext); |
| |
| void pcre2_jit_stack_assign(pcre2_match_context *mcontext, |
| pcre2_jit_callback callback_function, void *callback_data); |
| |
| void pcre2_jit_stack_free(pcre2_jit_stack *jit_stack); |
| |
| These functions provide support for JIT compilation, which, if the |
| just-in-time compiler is available, further processes a compiled pat- |
| tern into machine code that executes much faster than the pcre2_match() |
| interpretive matching function. Full details are given in the pcre2jit |
| documentation. |
| |
| JIT compilation is a heavyweight optimization. It can take some time |
| for patterns to be analyzed, and for one-off matches and simple pat- |
| terns the benefit of faster execution might be offset by a much slower |
| compilation time. Most (but not all) patterns can be optimized by the |
| JIT compiler. |
| |
| |
| LOCALE SUPPORT |
| |
| const uint8_t *pcre2_maketables(pcre2_general_context *gcontext); |
| |
| void pcre2_maketables_free(pcre2_general_context *gcontext, |
| const uint8_t *tables); |
| |
| PCRE2 handles caseless matching, and determines whether characters are |
| letters, digits, or whatever, by reference to a set of tables, indexed |
| by character code point. However, this applies only to characters whose |
| code points are less than 256. By default, higher-valued code points |
| never match escapes such as \w or \d. |
| |
| When PCRE2 is built with Unicode support (the default), certain Unicode |
| character properties can be tested with \p and \P, or, alternatively, |
| the PCRE2_UCP option can be set when a pattern is compiled; this causes |
| \w and friends to use Unicode property support instead of the built-in |
| tables. PCRE2_UCP also causes upper/lower casing operations on charac- |
| ters with code points greater than 127 to use Unicode properties. These |
| effects apply even when PCRE2_UTF is not set. |
| |
| The use of locales with Unicode is discouraged. If you are handling |
| characters with code points greater than 127, you should either use |
| Unicode support, or use locales, but not try to mix the two. |
| |
| PCRE2 contains a built-in set of character tables that are used by de- |
| fault. These are sufficient for many applications. Normally, the in- |
| ternal tables recognize only ASCII characters. However, when PCRE2 is |
| built, it is possible to cause the internal tables to be rebuilt in the |
| default "C" locale of the local system, which may cause them to be dif- |
| ferent. |
| |
| The built-in tables can be overridden by tables supplied by the appli- |
| cation that calls PCRE2. These may be created in a different locale |
| from the default. As more and more applications change to using Uni- |
| code, the need for this locale support is expected to die away. |
| |
| External tables are built by calling the pcre2_maketables() function, |
| in the relevant locale. The only argument to this function is a general |
| context, which can be used to pass a custom memory allocator. If the |
| argument is NULL, the system malloc() is used. The result can be passed |
| to pcre2_compile() as often as necessary, by creating a compile context |
| and calling pcre2_set_character_tables() to set the tables pointer |
| therein. |
| |
| For example, to build and use tables that are appropriate for the |
| French locale (where accented characters with values greater than 127 |
| are treated as letters), the following code could be used: |
| |
| setlocale(LC_CTYPE, "fr_FR"); |
| tables = pcre2_maketables(NULL); |
| ccontext = pcre2_compile_context_create(NULL); |
| pcre2_set_character_tables(ccontext, tables); |
| re = pcre2_compile(..., ccontext); |
| |
| The locale name "fr_FR" is used on Linux and other Unix-like systems; |
| if you are using Windows, the name for the French locale is "french". |
| |
| The pointer that is passed (via the compile context) to pcre2_compile() |
| is saved with the compiled pattern, and the same tables are used by the |
| matching functions. Thus, for any single pattern, compilation and |
| matching both happen in the same locale, but different patterns can be |
| processed in different locales. |
| |
| It is the caller's responsibility to ensure that the memory containing |
| the tables remains available while they are still in use. When they are |
| no longer needed, you can discard them using pcre2_maketables_free(), |
| which should pass as its first parameter the same global context that |
| was used to create the tables. |
| |
| Saving locale tables |
| |
| The tables described above are just a sequence of binary bytes, which |
| makes them independent of hardware characteristics such as endianness |
| or whether the processor is 32-bit or 64-bit. A copy of the result of |
| pcre2_maketables() can therefore be saved in a file or elsewhere and |
| re-used later, even in a different program or on another computer. The |
| size of the tables (number of bytes) must be obtained by calling |
| pcre2_config() with the PCRE2_CONFIG_TABLES_LENGTH option because |
| pcre2_maketables() does not return this value. Note that the |
| pcre2_dftables program, which is part of the PCRE2 build system, can be |
| used stand-alone to create a file that contains a set of binary tables. |
| See the pcre2build documentation for details. |
| |
| |
| INFORMATION ABOUT A COMPILED PATTERN |
| |
| int pcre2_pattern_info(const pcre2 *code, uint32_t what, void *where); |
| |
| The pcre2_pattern_info() function returns general information about a |
| compiled pattern. For information about callouts, see the next section. |
| The first argument for pcre2_pattern_info() is a pointer to the com- |
| piled pattern. The second argument specifies which piece of information |
| is required, and the third argument is a pointer to a variable to re- |
| ceive the data. If the third argument is NULL, the first argument is |
| ignored, and the function returns the size in bytes of the variable |
| that is required for the information requested. Otherwise, the yield of |
| the function is zero for success, or one of the following negative num- |
| bers: |
| |
| PCRE2_ERROR_NULL the argument code was NULL |
| PCRE2_ERROR_BADMAGIC the "magic number" was not found |
| PCRE2_ERROR_BADOPTION the value of what was invalid |
| PCRE2_ERROR_UNSET the requested field is not set |
| |
| The "magic number" is placed at the start of each compiled pattern as a |
| simple check against passing an arbitrary memory pointer. Here is a |
| typical call of pcre2_pattern_info(), to obtain the length of the com- |
| piled pattern: |
| |
| int rc; |
| size_t length; |
| rc = pcre2_pattern_info( |
| re, /* result of pcre2_compile() */ |
| PCRE2_INFO_SIZE, /* what is required */ |
| &length); /* where to put the data */ |
| |
| The possible values for the second argument are defined in pcre2.h, and |
| are as follows: |
| |
| PCRE2_INFO_ALLOPTIONS |
| PCRE2_INFO_ARGOPTIONS |
| PCRE2_INFO_EXTRAOPTIONS |
| |
| Return copies of the pattern's options. The third argument should point |
| to a uint32_t variable. PCRE2_INFO_ARGOPTIONS returns exactly the op- |
| tions that were passed to pcre2_compile(), whereas PCRE2_INFO_ALLOP- |
| TIONS returns the compile options as modified by any top-level (*XXX) |
| option settings such as (*UTF) at the start of the pattern itself. |
| PCRE2_INFO_EXTRAOPTIONS returns the extra options that were set in the |
| compile context by calling the pcre2_set_compile_extra_options() func- |
| tion. |
| |
| For example, if the pattern /(*UTF)abc/ is compiled with the PCRE2_EX- |
| TENDED option, the result for PCRE2_INFO_ALLOPTIONS is PCRE2_EXTENDED |
| and PCRE2_UTF. Option settings such as (?i) that can change within a |
| pattern do not affect the result of PCRE2_INFO_ALLOPTIONS, even if they |
| appear right at the start of the pattern. (This was different in some |
| earlier releases.) |
| |
| A pattern compiled without PCRE2_ANCHORED is automatically anchored by |
| PCRE2 if the first significant item in every top-level branch is one of |
| the following: |
| |
| ^ unless PCRE2_MULTILINE is set |
| \A always |
| \G always |
| .* sometimes - see below |
| |
| When .* is the first significant item, anchoring is possible only when |
| all the following are true: |
| |
| .* is not in an atomic group |
| .* is not in a capture group that is the subject |
| of a backreference |
| PCRE2_DOTALL is in force for .* |
| Neither (*PRUNE) nor (*SKIP) appears in the pattern |
| PCRE2_NO_DOTSTAR_ANCHOR is not set |
| |
| For patterns that are auto-anchored, the PCRE2_ANCHORED bit is set in |
| the options returned for PCRE2_INFO_ALLOPTIONS. |
| |
| PCRE2_INFO_BACKREFMAX |
| |
| Return the number of the highest backreference in the pattern. The |
| third argument should point to a uint32_t variable. Named capture |
| groups acquire numbers as well as names, and these count towards the |
| highest backreference. Backreferences such as \4 or \g{12} match the |
| captured characters of the given group, but in addition, the check that |
| a capture group is set in a conditional group such as (?(3)a|b) is also |
| a backreference. Zero is returned if there are no backreferences. |
| |
| PCRE2_INFO_BSR |
| |
| The output is a uint32_t integer whose value indicates what character |
| sequences the \R escape sequence matches. A value of PCRE2_BSR_UNICODE |
| means that \R matches any Unicode line ending sequence; a value of |
| PCRE2_BSR_ANYCRLF means that \R matches only CR, LF, or CRLF. |
| |
| PCRE2_INFO_CAPTURECOUNT |
| |
| Return the highest capture group number in the pattern. In patterns |
| where (?| is not used, this is also the total number of capture groups. |
| The third argument should point to a uint32_t variable. |
| |
| PCRE2_INFO_DEPTHLIMIT |
| |
| If the pattern set a backtracking depth limit by including an item of |
| the form (*LIMIT_DEPTH=nnnn) at the start, the value is returned. The |
| third argument should point to a uint32_t integer. If no such value has |
| been set, the call to pcre2_pattern_info() returns the error PCRE2_ER- |
| ROR_UNSET. Note that this limit will only be used during matching if it |
| is less than the limit set or defaulted by the caller of the match |
| function. |
| |
| PCRE2_INFO_FIRSTBITMAP |
| |
| In the absence of a single first code unit for a non-anchored pattern, |
| pcre2_compile() may construct a 256-bit table that defines a fixed set |
| of values for the first code unit in any match. For example, a pattern |
| that starts with [abc] results in a table with three bits set. When |
| code unit values greater than 255 are supported, the flag bit for 255 |
| means "any code unit of value 255 or above". If such a table was con- |
| structed, a pointer to it is returned. Otherwise NULL is returned. The |
| third argument should point to a const uint8_t * variable. |
| |
| PCRE2_INFO_FIRSTCODETYPE |
| |
| Return information about the first code unit of any matched string, for |
| a non-anchored pattern. The third argument should point to a uint32_t |
| variable. If there is a fixed first value, for example, the letter "c" |
| from a pattern such as (cat|cow|coyote), 1 is returned, and the value |
| can be retrieved using PCRE2_INFO_FIRSTCODEUNIT. If there is no fixed |
| first value, but it is known that a match can occur only at the start |
| of the subject or following a newline in the subject, 2 is returned. |
| Otherwise, and for anchored patterns, 0 is returned. |
| |
| PCRE2_INFO_FIRSTCODEUNIT |
| |
| Return the value of the first code unit of any matched string for a |
| pattern where PCRE2_INFO_FIRSTCODETYPE returns 1; otherwise return 0. |
| The third argument should point to a uint32_t variable. In the 8-bit |
| library, the value is always less than 256. In the 16-bit library the |
| value can be up to 0xffff. In the 32-bit library in UTF-32 mode the |
| value can be up to 0x10ffff, and up to 0xffffffff when not using UTF-32 |
| mode. |
| |
| PCRE2_INFO_FRAMESIZE |
| |
| Return the size (in bytes) of the data frames that are used to remember |
| backtracking positions when the pattern is processed by pcre2_match() |
| without the use of JIT. The third argument should point to a size_t |
| variable. The frame size depends on the number of capturing parentheses |
| in the pattern. Each additional capture group adds two PCRE2_SIZE vari- |
| ables. |
| |
| PCRE2_INFO_HASBACKSLASHC |
| |
| Return 1 if the pattern contains any instances of \C, otherwise 0. The |
| third argument should point to a uint32_t variable. |
| |
| PCRE2_INFO_HASCRORLF |
| |
| Return 1 if the pattern contains any explicit matches for CR or LF |
| characters, otherwise 0. The third argument should point to a uint32_t |
| variable. An explicit match is either a literal CR or LF character, or |
| \r or \n or one of the equivalent hexadecimal or octal escape se- |
| quences. |
| |
| PCRE2_INFO_HEAPLIMIT |
| |
| If the pattern set a heap memory limit by including an item of the form |
| (*LIMIT_HEAP=nnnn) at the start, the value is returned. The third argu- |
| ment should point to a uint32_t integer. If no such value has been set, |
| the call to pcre2_pattern_info() returns the error PCRE2_ERROR_UNSET. |
| Note that this limit will only be used during matching if it is less |
| than the limit set or defaulted by the caller of the match function. |
| |
| PCRE2_INFO_JCHANGED |
| |
| Return 1 if the (?J) or (?-J) option setting is used in the pattern, |
| otherwise 0. The third argument should point to a uint32_t variable. |
| (?J) and (?-J) set and unset the local PCRE2_DUPNAMES option, respec- |
| tively. |
| |
| PCRE2_INFO_JITSIZE |
| |
| If the compiled pattern was successfully processed by pcre2_jit_com- |
| pile(), return the size of the JIT compiled code, otherwise return |
| zero. The third argument should point to a size_t variable. |
| |
| PCRE2_INFO_LASTCODETYPE |
| |
| Returns 1 if there is a rightmost literal code unit that must exist in |
| any matched string, other than at its start. The third argument should |
| point to a uint32_t variable. If there is no such value, 0 is returned. |
| When 1 is returned, the code unit value itself can be retrieved using |
| PCRE2_INFO_LASTCODEUNIT. For anchored patterns, a last literal value is |
| recorded only if it follows something of variable length. For example, |
| for the pattern /^a\d+z\d+/ the returned value is 1 (with "z" returned |
| from PCRE2_INFO_LASTCODEUNIT), but for /^a\dz\d/ the returned value is |
| 0. |
| |
| PCRE2_INFO_LASTCODEUNIT |
| |
| Return the value of the rightmost literal code unit that must exist in |
| any matched string, other than at its start, for a pattern where |
| PCRE2_INFO_LASTCODETYPE returns 1. Otherwise, return 0. The third argu- |
| ment should point to a uint32_t variable. |
| |
| PCRE2_INFO_MATCHEMPTY |
| |
| Return 1 if the pattern might match an empty string, otherwise 0. The |
| third argument should point to a uint32_t variable. When a pattern con- |
| tains recursive subroutine calls it is not always possible to determine |
| whether or not it can match an empty string. PCRE2 takes a cautious ap- |
| proach and returns 1 in such cases. |
| |
| PCRE2_INFO_MATCHLIMIT |
| |
| If the pattern set a match limit by including an item of the form |
| (*LIMIT_MATCH=nnnn) at the start, the value is returned. The third ar- |
| gument should point to a uint32_t integer. If no such value has been |
| set, the call to pcre2_pattern_info() returns the error PCRE2_ERROR_UN- |
| SET. Note that this limit will only be used during matching if it is |
| less than the limit set or defaulted by the caller of the match func- |
| tion. |
| |
| PCRE2_INFO_MAXLOOKBEHIND |
| |
| A lookbehind assertion moves back a certain number of characters (not |
| code units) when it starts to process each of its branches. This re- |
| quest returns the largest of these backward moves. The third argument |
| should point to a uint32_t integer. The simple assertions \b and \B re- |
| quire a one-character lookbehind and cause PCRE2_INFO_MAXLOOKBEHIND to |
| return 1 in the absence of anything longer. \A also registers a one- |
| character lookbehind, though it does not actually inspect the previous |
| character. |
| |
| Note that this information is useful for multi-segment matching only if |
| the pattern contains no nested lookbehinds. For example, the pattern |
| (?<=a(?<=ba)c) returns a maximum lookbehind of 2, but when it is pro- |
| cessed, the first lookbehind moves back by two characters, matches one |
| character, then the nested lookbehind also moves back by two charac- |
| ters. This puts the matching point three characters earlier than it was |
| at the start. PCRE2_INFO_MAXLOOKBEHIND is really only useful as a de- |
| bugging tool. See the pcre2partial documentation for a discussion of |
| multi-segment matching. |
| |
| PCRE2_INFO_MINLENGTH |
| |
| If a minimum length for matching subject strings was computed, its |
| value is returned. Otherwise the returned value is 0. This value is not |
| computed when PCRE2_NO_START_OPTIMIZE is set. The value is a number of |
| characters, which in UTF mode may be different from the number of code |
| units. The third argument should point to a uint32_t variable. The |
| value is a lower bound to the length of any matching string. There may |
| not be any strings of that length that do actually match, but every |
| string that does match is at least that long. |
| |
| PCRE2_INFO_NAMECOUNT |
| PCRE2_INFO_NAMEENTRYSIZE |
| PCRE2_INFO_NAMETABLE |
| |
| PCRE2 supports the use of named as well as numbered capturing parenthe- |
| ses. The names are just an additional way of identifying the parenthe- |
| ses, which still acquire numbers. Several convenience functions such as |
| pcre2_substring_get_byname() are provided for extracting captured sub- |
| strings by name. It is also possible to extract the data directly, by |
| first converting the name to a number in order to access the correct |
| pointers in the output vector (described with pcre2_match() below). To |
| do the conversion, you need to use the name-to-number map, which is de- |
| scribed by these three values. |
| |
| The map consists of a number of fixed-size entries. PCRE2_INFO_NAME- |
| COUNT gives the number of entries, and PCRE2_INFO_NAMEENTRYSIZE gives |
| the size of each entry in code units; both of these return a uint32_t |
| value. The entry size depends on the length of the longest name. |
| |
| PCRE2_INFO_NAMETABLE returns a pointer to the first entry of the table. |
| This is a PCRE2_SPTR pointer to a block of code units. In the 8-bit li- |
| brary, the first two bytes of each entry are the number of the captur- |
| ing parenthesis, most significant byte first. In the 16-bit library, |
| the pointer points to 16-bit code units, the first of which contains |
| the parenthesis number. In the 32-bit library, the pointer points to |
| 32-bit code units, the first of which contains the parenthesis number. |
| The rest of the entry is the corresponding name, zero terminated. |
| |
| The names are in alphabetical order. If (?| is used to create multiple |
| capture groups with the same number, as described in the section on du- |
| plicate group numbers in the pcre2pattern page, the groups may be given |
| the same name, but there is only one entry in the table. Different |
| names for groups of the same number are not permitted. |
| |
| Duplicate names for capture groups with different numbers are permit- |
| ted, but only if PCRE2_DUPNAMES is set. They appear in the table in the |
| order in which they were found in the pattern. In the absence of (?| |
| this is the order of increasing number; when (?| is used this is not |
| necessarily the case because later capture groups may have lower num- |
| bers. |
| |
| As a simple example of the name/number table, consider the following |
| pattern after compilation by the 8-bit library (assume PCRE2_EXTENDED |
| is set, so white space - including newlines - is ignored): |
| |
| (?<date> (?<year>(\d\d)?\d\d) - |
| (?<month>\d\d) - (?<day>\d\d) ) |
| |
| There are four named capture groups, so the table has four entries, and |
| each entry in the table is eight bytes long. The table is as follows, |
| with non-printing bytes shows in hexadecimal, and undefined bytes shown |
| as ??: |
| |
| 00 01 d a t e 00 ?? |
| 00 05 d a y 00 ?? ?? |
| 00 04 m o n t h 00 |
| 00 02 y e a r 00 ?? |
| |
| When writing code to extract data from named capture groups using the |
| name-to-number map, remember that the length of the entries is likely |
| to be different for each compiled pattern. |
| |
| PCRE2_INFO_NEWLINE |
| |
| The output is one of the following uint32_t values: |
| |
| PCRE2_NEWLINE_CR Carriage return (CR) |
| PCRE2_NEWLINE_LF Linefeed (LF) |
| PCRE2_NEWLINE_CRLF Carriage return, linefeed (CRLF) |
| PCRE2_NEWLINE_ANY Any Unicode line ending |
| PCRE2_NEWLINE_ANYCRLF Any of CR, LF, or CRLF |
| PCRE2_NEWLINE_NUL The NUL character (binary zero) |
| |
| This identifies the character sequence that will be recognized as mean- |
| ing "newline" while matching. |
| |
| PCRE2_INFO_SIZE |
| |
| Return the size of the compiled pattern in bytes (for all three li- |
| braries). The third argument should point to a size_t variable. This |
| value includes the size of the general data block that precedes the |
| code units of the compiled pattern itself. The value that is used when |
| pcre2_compile() is getting memory in which to place the compiled pat- |
| tern may be slightly larger than the value returned by this option, be- |
| cause there are cases where the code that calculates the size has to |
| over-estimate. Processing a pattern with the JIT compiler does not al- |
| ter the value returned by this option. |
| |
| |
| INFORMATION ABOUT A PATTERN'S CALLOUTS |
| |
| int pcre2_callout_enumerate(const pcre2_code *code, |
| int (*callback)(pcre2_callout_enumerate_block *, void *), |
| void *user_data); |
| |
| A script language that supports the use of string arguments in callouts |
| might like to scan all the callouts in a pattern before running the |
| match. This can be done by calling pcre2_callout_enumerate(). The first |
| argument is a pointer to a compiled pattern, the second points to a |
| callback function, and the third is arbitrary user data. The callback |
| function is called for every callout in the pattern in the order in |
| which they appear. Its first argument is a pointer to a callout enumer- |
| ation block, and its second argument is the user_data value that was |
| passed to pcre2_callout_enumerate(). The contents of the callout enu- |
| meration block are described in the pcre2callout documentation, which |
| also gives further details about callouts. |
| |
| |
| SERIALIZATION AND PRECOMPILING |
| |
| It is possible to save compiled patterns on disc or elsewhere, and |
| reload them later, subject to a number of restrictions. The host on |
| which the patterns are reloaded must be running the same version of |
| PCRE2, with the same code unit width, and must also have the same endi- |
| anness, pointer width, and PCRE2_SIZE type. Before compiled patterns |
| can be saved, they must be converted to a "serialized" form, which in |
| the case of PCRE2 is really just a bytecode dump. The functions whose |
| names begin with pcre2_serialize_ are used for converting to and from |
| the serialized form. They are described in the pcre2serialize documen- |
| tation. Note that PCRE2 serialization does not convert compiled pat- |
| terns to an abstract format like Java or .NET serialization. |
| |
| |
| THE MATCH DATA BLOCK |
| |
| pcre2_match_data *pcre2_match_data_create(uint32_t ovecsize, |
| pcre2_general_context *gcontext); |
| |
| pcre2_match_data *pcre2_match_data_create_from_pattern( |
| const pcre2_code *code, pcre2_general_context *gcontext); |
| |
| void pcre2_match_data_free(pcre2_match_data *match_data); |
| |
| Information about a successful or unsuccessful match is placed in a |
| match data block, which is an opaque structure that is accessed by |
| function calls. In particular, the match data block contains a vector |
| of offsets into the subject string that define the matched parts of the |
| subject. This is known as the ovector. |
| |
| Before calling pcre2_match(), pcre2_dfa_match(), or pcre2_jit_match() |
| you must create a match data block by calling one of the creation func- |
| tions above. For pcre2_match_data_create(), the first argument is the |
| number of pairs of offsets in the ovector. |
| |
| When using pcre2_match(), one pair of offsets is required to identify |
| the string that matched the whole pattern, with an additional pair for |
| each captured substring. For example, a value of 4 creates enough space |
| to record the matched portion of the subject plus three captured sub- |
| strings. |
| |
| When using pcre2_dfa_match() there may be multiple matched substrings |
| of different lengths at the same point in the subject. The ovector |
| should be made large enough to hold as many as are expected. |
| |
| A minimum of at least 1 pair is imposed by pcre2_match_data_create(), |
| so it is always possible to return the overall matched string in the |
| case of pcre2_match() or the longest match in the case of |
| pcre2_dfa_match(). The maximum number of pairs is 65535; if the the |
| first argument of pcre2_match_data_create() is greater than this, 65535 |
| is used. |
| |
| The second argument of pcre2_match_data_create() is a pointer to a gen- |
| eral context, which can specify custom memory management for obtaining |
| the memory for the match data block. If you are not using custom memory |
| management, pass NULL, which causes malloc() to be used. |
| |
| For pcre2_match_data_create_from_pattern(), the first argument is a |
| pointer to a compiled pattern. The ovector is created to be exactly the |
| right size to hold all the substrings a pattern might capture when |
| matched using pcre2_match(). You should not use this call when matching |
| with pcre2_dfa_match(). The second argument is again a pointer to a |
| general context, but in this case if NULL is passed, the memory is ob- |
| tained using the same allocator that was used for the compiled pattern |
| (custom or default). |
| |
| A match data block can be used many times, with the same or different |
| compiled patterns. You can extract information from a match data block |
| after a match operation has finished, using functions that are de- |
| scribed in the sections on matched strings and other match data below. |
| |
| When a call of pcre2_match() fails, valid data is available in the |
| match block only when the error is PCRE2_ERROR_NOMATCH, PCRE2_ER- |
| ROR_PARTIAL, or one of the error codes for an invalid UTF string. Ex- |
| actly what is available depends on the error, and is detailed below. |
| |
| When one of the matching functions is called, pointers to the compiled |
| pattern and the subject string are set in the match data block so that |
| they can be referenced by the extraction functions after a successful |
| match. After running a match, you must not free a compiled pattern or a |
| subject string until after all operations on the match data block (for |
| that match) have taken place, unless, in the case of the subject |
| string, you have used the PCRE2_COPY_MATCHED_SUBJECT option, which is |
| described in the section entitled "Option bits for pcre2_match()" be- |
| low. |
| |
| When a match data block itself is no longer needed, it should be freed |
| by calling pcre2_match_data_free(). If this function is called with a |
| NULL argument, it returns immediately, without doing anything. |
| |
| |
| MATCHING A PATTERN: THE TRADITIONAL FUNCTION |
| |
| int pcre2_match(const pcre2_code *code, PCRE2_SPTR subject, |
| PCRE2_SIZE length, PCRE2_SIZE startoffset, |
| uint32_t options, pcre2_match_data *match_data, |
| pcre2_match_context *mcontext); |
| |
| The function pcre2_match() is called to match a subject string against |
| a compiled pattern, which is passed in the code argument. You can call |
| pcre2_match() with the same code argument as many times as you like, in |
| order to find multiple matches in the subject string or to match dif- |
| ferent subject strings with the same pattern. |
| |
| This function is the main matching facility of the library, and it op- |
| erates in a Perl-like manner. For specialist use there is also an al- |
| ternative matching function, which is described below in the section |
| about the pcre2_dfa_match() function. |
| |
| Here is an example of a simple call to pcre2_match(): |
| |
| pcre2_match_data *md = pcre2_match_data_create(4, NULL); |
| int rc = pcre2_match( |
| re, /* result of pcre2_compile() */ |
| "some string", /* the subject string */ |
| 11, /* the length of the subject string */ |
| 0, /* start at offset 0 in the subject */ |
| 0, /* default options */ |
| md, /* the match data block */ |
| NULL); /* a match context; NULL means use defaults */ |
| |
| If the subject string is zero-terminated, the length can be given as |
| PCRE2_ZERO_TERMINATED. A match context must be provided if certain less |
| common matching parameters are to be changed. For details, see the sec- |
| tion on the match context above. |
| |
| The string to be matched by pcre2_match() |
| |
| The subject string is passed to pcre2_match() as a pointer in subject, |
| a length in length, and a starting offset in startoffset. The length |
| and offset are in code units, not characters. That is, they are in |
| bytes for the 8-bit library, 16-bit code units for the 16-bit library, |
| and 32-bit code units for the 32-bit library, whether or not UTF pro- |
| cessing is enabled. As a special case, if subject is NULL and length is |
| zero, the subject is assumed to be an empty string. If length is non- |
| zero, an error occurs if subject is NULL. |
| |
| If startoffset is greater than the length of the subject, pcre2_match() |
| returns PCRE2_ERROR_BADOFFSET. When the starting offset is zero, the |
| search for a match starts at the beginning of the subject, and this is |
| by far the most common case. In UTF-8 or UTF-16 mode, the starting off- |
| set must point to the start of a character, or to the end of the sub- |
| ject (in UTF-32 mode, one code unit equals one character, so all off- |
| sets are valid). Like the pattern string, the subject may contain bi- |
| nary zeros. |
| |
| A non-zero starting offset is useful when searching for another match |
| in the same subject by calling pcre2_match() again after a previous |
| success. Setting startoffset differs from passing over a shortened |
| string and setting PCRE2_NOTBOL in the case of a pattern that begins |
| with any kind of lookbehind. For example, consider the pattern |
| |
| \Biss\B |
| |
| which finds occurrences of "iss" in the middle of words. (\B matches |
| only if the current position in the subject is not a word boundary.) |
| When applied to the string "Mississippi" the first call to |
| pcre2_match() finds the first occurrence. If pcre2_match() is called |
| again with just the remainder of the subject, namely "issippi", it does |
| not match, because \B is always false at the start of the subject, |
| which is deemed to be a word boundary. However, if pcre2_match() is |
| passed the entire string again, but with startoffset set to 4, it finds |
| the second occurrence of "iss" because it is able to look behind the |
| starting point to discover that it is preceded by a letter. |
| |
| Finding all the matches in a subject is tricky when the pattern can |
| match an empty string. It is possible to emulate Perl's /g behaviour by |
| first trying the match again at the same offset, with the |
| PCRE2_NOTEMPTY_ATSTART and PCRE2_ANCHORED options, and then if that |
| fails, advancing the starting offset and trying an ordinary match |
| again. There is some code that demonstrates how to do this in the |
| pcre2demo sample program. In the most general case, you have to check |
| to see if the newline convention recognizes CRLF as a newline, and if |
| so, and the current character is CR followed by LF, advance the start- |
| ing offset by two characters instead of one. |
| |
| If a non-zero starting offset is passed when the pattern is anchored, a |
| single attempt to match at the given offset is made. This can only suc- |
| ceed if the pattern does not require the match to be at the start of |
| the subject. In other words, the anchoring must be the result of set- |
| ting the PCRE2_ANCHORED option or the use of .* with PCRE2_DOTALL, not |
| by starting the pattern with ^ or \A. |
| |
| Option bits for pcre2_match() |
| |
| The unused bits of the options argument for pcre2_match() must be zero. |
| The only bits that may be set are PCRE2_ANCHORED, |
| PCRE2_COPY_MATCHED_SUBJECT, PCRE2_ENDANCHORED, PCRE2_NOTBOL, PCRE2_NO- |
| TEOL, PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART, PCRE2_NO_JIT, |
| PCRE2_NO_UTF_CHECK, PCRE2_PARTIAL_HARD, and PCRE2_PARTIAL_SOFT. Their |
| action is described below. |
| |
| Setting PCRE2_ANCHORED or PCRE2_ENDANCHORED at match time is not sup- |
| ported by the just-in-time (JIT) compiler. If it is set, JIT matching |
| is disabled and the interpretive code in pcre2_match() is run. Apart |
| from PCRE2_NO_JIT (obviously), the remaining options are supported for |
| JIT matching. |
| |
| PCRE2_ANCHORED |
| |
| The PCRE2_ANCHORED option limits pcre2_match() to matching at the first |
| matching position. If a pattern was compiled with PCRE2_ANCHORED, or |
| turned out to be anchored by virtue of its contents, it cannot be made |
| unachored at matching time. Note that setting the option at match time |
| disables JIT matching. |
| |
| PCRE2_COPY_MATCHED_SUBJECT |
| |
| By default, a pointer to the subject is remembered in the match data |
| block so that, after a successful match, it can be referenced by the |
| substring extraction functions. This means that the subject's memory |
| must not be freed until all such operations are complete. For some ap- |
| plications where the lifetime of the subject string is not guaranteed, |
| it may be necessary to make a copy of the subject string, but it is |
| wasteful to do this unless the match is successful. After a successful |
| match, if PCRE2_COPY_MATCHED_SUBJECT is set, the subject is copied and |
| the new pointer is remembered in the match data block instead of the |
| original subject pointer. The memory allocator that was used for the |
| match block itself is used. The copy is automatically freed when |
| pcre2_match_data_free() is called to free the match data block. It is |
| also automatically freed if the match data block is re-used for another |
| match operation. |
| |
| PCRE2_ENDANCHORED |
| |
| If the PCRE2_ENDANCHORED option is set, any string that pcre2_match() |
| matches must be right at the end of the subject string. Note that set- |
| ting the option at match time disables JIT matching. |
| |
| PCRE2_NOTBOL |
| |
| This option specifies that first character of the subject string is not |
| the beginning of a line, so the circumflex metacharacter should not |
| match before it. Setting this without having set PCRE2_MULTILINE at |
| compile time causes circumflex never to match. This option affects only |
| the behaviour of the circumflex metacharacter. It does not affect \A. |
| |
| PCRE2_NOTEOL |
| |
| This option specifies that the end of the subject string is not the end |
| of a line, so the dollar metacharacter should not match it nor (except |
| in multiline mode) a newline immediately before it. Setting this with- |
| out having set PCRE2_MULTILINE at compile time causes dollar never to |
| match. This option affects only the behaviour of the dollar metacharac- |
| ter. It does not affect \Z or \z. |
| |
| PCRE2_NOTEMPTY |
| |
| An empty string is not considered to be a valid match if this option is |
| set. If there are alternatives in the pattern, they are tried. If all |
| the alternatives match the empty string, the entire match fails. For |
| example, if the pattern |
| |
| a?b? |
| |
| is applied to a string not beginning with "a" or "b", it matches an |
| empty string at the start of the subject. With PCRE2_NOTEMPTY set, this |
| match is not valid, so pcre2_match() searches further into the string |
| for occurrences of "a" or "b". |
| |
| PCRE2_NOTEMPTY_ATSTART |
| |
| This is like PCRE2_NOTEMPTY, except that it locks out an empty string |
| match only at the first matching position, that is, at the start of the |
| subject plus the starting offset. An empty string match later in the |
| subject is permitted. If the pattern is anchored, such a match can oc- |
| cur only if the pattern contains \K. |
| |
| PCRE2_NO_JIT |
| |
| By default, if a pattern has been successfully processed by |
| pcre2_jit_compile(), JIT is automatically used when pcre2_match() is |
| called with options that JIT supports. Setting PCRE2_NO_JIT disables |
| the use of JIT; it forces matching to be done by the interpreter. |
| |
| PCRE2_NO_UTF_CHECK |
| |
| When PCRE2_UTF is set at compile time, the validity of the subject as a |
| UTF string is checked unless PCRE2_NO_UTF_CHECK is passed to |
| pcre2_match() or PCRE2_MATCH_INVALID_UTF was passed to pcre2_compile(). |
| The latter special case is discussed in detail in the pcre2unicode doc- |
| umentation. |
| |
| In the default case, if a non-zero starting offset is given, the check |
| is applied only to that part of the subject that could be inspected |
| during matching, and there is a check that the starting offset points |
| to the first code unit of a character or to the end of the subject. If |
| there are no lookbehind assertions in the pattern, the check starts at |
| the starting offset. Otherwise, it starts at the length of the longest |
| lookbehind before the starting offset, or at the start of the subject |
| if there are not that many characters before the starting offset. Note |
| that the sequences \b and \B are one-character lookbehinds. |
| |
| The check is carried out before any other processing takes place, and a |
| negative error code is returned if the check fails. There are several |
| UTF error codes for each code unit width, corresponding to different |
| problems with the code unit sequence. There are discussions about the |
| validity of UTF-8 strings, UTF-16 strings, and UTF-32 strings in the |
| pcre2unicode documentation. |
| |
| If you know that your subject is valid, and you want to skip this check |
| for performance reasons, you can set the PCRE2_NO_UTF_CHECK option when |
| calling pcre2_match(). You might want to do this for the second and |
| subsequent calls to pcre2_match() if you are making repeated calls to |
| find multiple matches in the same subject string. |
| |
| Warning: Unless PCRE2_MATCH_INVALID_UTF was set at compile time, when |
| PCRE2_NO_UTF_CHECK is set at match time the effect of passing an in- |
| valid string as a subject, or an invalid value of startoffset, is unde- |
| fined. Your program may crash or loop indefinitely or give wrong re- |
| sults. |
| |
| PCRE2_PARTIAL_HARD |
| PCRE2_PARTIAL_SOFT |
| |
| These options turn on the partial matching feature. A partial match oc- |
| curs if the end of the subject string is reached successfully, but |
| there are not enough subject characters to complete the match. In addi- |
| tion, either at least one character must have been inspected or the |
| pattern must contain a lookbehind, or the pattern must be one that |
| could match an empty string. |
| |
| If this situation arises when PCRE2_PARTIAL_SOFT (but not PCRE2_PAR- |
| TIAL_HARD) is set, matching continues by testing any remaining alterna- |
| tives. Only if no complete match can be found is PCRE2_ERROR_PARTIAL |
| returned instead of PCRE2_ERROR_NOMATCH. In other words, PCRE2_PAR- |
| TIAL_SOFT specifies that the caller is prepared to handle a partial |
| match, but only if no complete match can be found. |
| |
| If PCRE2_PARTIAL_HARD is set, it overrides PCRE2_PARTIAL_SOFT. In this |
| case, if a partial match is found, pcre2_match() immediately returns |
| PCRE2_ERROR_PARTIAL, without considering any other alternatives. In |
| other words, when PCRE2_PARTIAL_HARD is set, a partial match is consid- |
| ered to be more important that an alternative complete match. |
| |
| There is a more detailed discussion of partial and multi-segment match- |
| ing, with examples, in the pcre2partial documentation. |
| |
| |
| NEWLINE HANDLING WHEN MATCHING |
| |
| When PCRE2 is built, a default newline convention is set; this is usu- |
| ally the standard convention for the operating system. The default can |
| be overridden in a compile context by calling pcre2_set_newline(). It |
| can also be overridden by starting a pattern string with, for example, |
| (*CRLF), as described in the section on newline conventions in the |
| pcre2pattern page. During matching, the newline choice affects the be- |
| haviour of the dot, circumflex, and dollar metacharacters. It may also |
| alter the way the match starting position is advanced after a match |
| failure for an unanchored pattern. |
| |
| When PCRE2_NEWLINE_CRLF, PCRE2_NEWLINE_ANYCRLF, or PCRE2_NEWLINE_ANY is |
| set as the newline convention, and a match attempt for an unanchored |
| pattern fails when the current starting position is at a CRLF sequence, |
| and the pattern contains no explicit matches for CR or LF characters, |
| the match position is advanced by two characters instead of one, in |
| other words, to after the CRLF. |
| |
| The above rule is a compromise that makes the most common cases work as |
| expected. For example, if the pattern is .+A (and the PCRE2_DOTALL op- |
| tion is not set), it does not match the string "\r\nA" because, after |
| failing at the start, it skips both the CR and the LF before retrying. |
| However, the pattern [\r\n]A does match that string, because it con- |
| tains an explicit CR or LF reference, and so advances only by one char- |
| acter after the first failure. |
| |
| An explicit match for CR of LF is either a literal appearance of one of |
| those characters in the pattern, or one of the \r or \n or equivalent |
| octal or hexadecimal escape sequences. Implicit matches such as [^X] do |
| not count, nor does \s, even though it includes CR and LF in the char- |
| acters that it matches. |
| |
| Notwithstanding the above, anomalous effects may still occur when CRLF |
| is a valid newline sequence and explicit \r or \n escapes appear in the |
| pattern. |
| |
| |
| HOW PCRE2_MATCH() RETURNS A STRING AND CAPTURED SUBSTRINGS |
| |
| uint32_t pcre2_get_ovector_count(pcre2_match_data *match_data); |
| |
| PCRE2_SIZE *pcre2_get_ovector_pointer(pcre2_match_data *match_data); |
| |
| In general, a pattern matches a certain portion of the subject, and in |
| addition, further substrings from the subject may be picked out by |
| parenthesized parts of the pattern. Following the usage in Jeffrey |
| Friedl's book, this is called "capturing" in what follows, and the |
| phrase "capture group" (Perl terminology) is used for a fragment of a |
| pattern that picks out a substring. PCRE2 supports several other kinds |
| of parenthesized group that do not cause substrings to be captured. The |
| pcre2_pattern_info() function can be used to find out how many capture |
| groups there are in a compiled pattern. |
| |
| You can use auxiliary functions for accessing captured substrings by |
| number or by name, as described in sections below. |
| |
| Alternatively, you can make direct use of the vector of PCRE2_SIZE val- |
| ues, called the ovector, which contains the offsets of captured |
| strings. It is part of the match data block. The function |
| pcre2_get_ovector_pointer() returns the address of the ovector, and |
| pcre2_get_ovector_count() returns the number of pairs of values it con- |
| tains. |
| |
| Within the ovector, the first in each pair of values is set to the off- |
| set of the first code unit of a substring, and the second is set to the |
| offset of the first code unit after the end of a substring. These val- |
| ues are always code unit offsets, not character offsets. That is, they |
| are byte offsets in the 8-bit library, 16-bit offsets in the 16-bit li- |
| brary, and 32-bit offsets in the 32-bit library. |
| |
| After a partial match (error return PCRE2_ERROR_PARTIAL), only the |
| first pair of offsets (that is, ovector[0] and ovector[1]) are set. |
| They identify the part of the subject that was partially matched. See |
| the pcre2partial documentation for details of partial matching. |
| |
| After a fully successful match, the first pair of offsets identifies |
| the portion of the subject string that was matched by the entire pat- |
| tern. The next pair is used for the first captured substring, and so |
| on. The value returned by pcre2_match() is one more than the highest |
| numbered pair that has been set. For example, if two substrings have |
| been captured, the returned value is 3. If there are no captured sub- |
| strings, the return value from a successful match is 1, indicating that |
| just the first pair of offsets has been set. |
| |
| If a pattern uses the \K escape sequence within a positive assertion, |
| the reported start of a successful match can be greater than the end of |
| the match. For example, if the pattern (?=ab\K) is matched against |
| "ab", the start and end offset values for the match are 2 and 0. |
| |
| If a capture group is matched repeatedly within a single match opera- |
| tion, it is the last portion of the subject that it matched that is re- |
| turned. |
| |
| If the ovector is too small to hold all the captured substring offsets, |
| as much as possible is filled in, and the function returns a value of |
| zero. If captured substrings are not of interest, pcre2_match() may be |
| called with a match data block whose ovector is of minimum length (that |
| is, one pair). |
| |
| It is possible for capture group number n+1 to match some part of the |
| subject when group n has not been used at all. For example, if the |
| string "abc" is matched against the pattern (a|(z))(bc) the return from |
| the function is 4, and groups 1 and 3 are matched, but 2 is not. When |
| this happens, both values in the offset pairs corresponding to unused |
| groups are set to PCRE2_UNSET. |
| |
| Offset values that correspond to unused groups at the end of the ex- |
| pression are also set to PCRE2_UNSET. For example, if the string "abc" |
| is matched against the pattern (abc)(x(yz)?)? groups 2 and 3 are not |
| matched. The return from the function is 2, because the highest used |
| capture group number is 1. The offsets for for the second and third |
| capture groupss (assuming the vector is large enough, of course) are |
| set to PCRE2_UNSET. |
| |
| Elements in the ovector that do not correspond to capturing parentheses |
| in the pattern are never changed. That is, if a pattern contains n cap- |
| turing parentheses, no more than ovector[0] to ovector[2n+1] are set by |
| pcre2_match(). The other elements retain whatever values they previ- |
| ously had. After a failed match attempt, the contents of the ovector |
| are unchanged. |
| |
| |
| OTHER INFORMATION ABOUT A MATCH |
| |
| PCRE2_SPTR pcre2_get_mark(pcre2_match_data *match_data); |
| |
| PCRE2_SIZE pcre2_get_startchar(pcre2_match_data *match_data); |
| |
| As well as the offsets in the ovector, other information about a match |
| is retained in the match data block and can be retrieved by the above |
| functions in appropriate circumstances. If they are called at other |
| times, the result is undefined. |
| |
| After a successful match, a partial match (PCRE2_ERROR_PARTIAL), or a |
| failure to match (PCRE2_ERROR_NOMATCH), a mark name may be available. |
| The function pcre2_get_mark() can be called to access this name, which |
| can be specified in the pattern by any of the backtracking control |
| verbs, not just (*MARK). The same function applies to all the verbs. It |
| returns a pointer to the zero-terminated name, which is within the com- |
| piled pattern. If no name is available, NULL is returned. The length of |
| the name (excluding the terminating zero) is stored in the code unit |
| that precedes the name. You should use this length instead of relying |
| on the terminating zero if the name might contain a binary zero. |
| |
| After a successful match, the name that is returned is the last mark |
| name encountered on the matching path through the pattern. Instances of |
| backtracking verbs without names do not count. Thus, for example, if |
| the matching path contains (*MARK:A)(*PRUNE), the name "A" is returned. |
| After a "no match" or a partial match, the last encountered name is re- |
| turned. For example, consider this pattern: |
| |
| ^(*MARK:A)((*MARK:B)a|b)c |
| |
| When it matches "bc", the returned name is A. The B mark is "seen" in |
| the first branch of the group, but it is not on the matching path. On |
| the other hand, when this pattern fails to match "bx", the returned |
| name is B. |
| |
| Warning: By default, certain start-of-match optimizations are used to |
| give a fast "no match" result in some situations. For example, if the |
| anchoring is removed from the pattern above, there is an initial check |
| for the presence of "c" in the subject before running the matching en- |
| gine. This check fails for "bx", causing a match failure without seeing |
| any marks. You can disable the start-of-match optimizations by setting |
| the PCRE2_NO_START_OPTIMIZE option for pcre2_compile() or by starting |
| the pattern with (*NO_START_OPT). |
| |
| After a successful match, a partial match, or one of the invalid UTF |
| errors (for example, PCRE2_ERROR_UTF8_ERR5), pcre2_get_startchar() can |
| be called. After a successful or partial match it returns the code unit |
| offset of the character at which the match started. For a non-partial |
| match, this can be different to the value of ovector[0] if the pattern |
| contains the \K escape sequence. After a partial match, however, this |
| value is always the same as ovector[0] because \K does not affect the |
| result of a partial match. |
| |
| After a UTF check failure, pcre2_get_startchar() can be used to obtain |
| the code unit offset of the invalid UTF character. Details are given in |
| the pcre2unicode page. |
| |
| |
| ERROR RETURNS FROM pcre2_match() |
| |
| If pcre2_match() fails, it returns a negative number. This can be con- |
| verted to a text string by calling the pcre2_get_error_message() func- |
| tion (see "Obtaining a textual error message" below). Negative error |
| codes are also returned by other functions, and are documented with |
| them. The codes are given names in the header file. If UTF checking is |
| in force and an invalid UTF subject string is detected, one of a number |
| of UTF-specific negative error codes is returned. Details are given in |
| the pcre2unicode page. The following are the other errors that may be |
| returned by pcre2_match(): |
| |
| PCRE2_ERROR_NOMATCH |
| |
| The subject string did not match the pattern. |
| |
| PCRE2_ERROR_PARTIAL |
| |
| The subject string did not match, but it did match partially. See the |
| pcre2partial documentation for details of partial matching. |
| |
| PCRE2_ERROR_BADMAGIC |
| |
| PCRE2 stores a 4-byte "magic number" at the start of the compiled code, |
| to catch the case when it is passed a junk pointer. This is the error |
| that is returned when the magic number is not present. |
| |
| PCRE2_ERROR_BADMODE |
| |
| This error is given when a compiled pattern is passed to a function in |
| a library of a different code unit width, for example, a pattern com- |
| piled by the 8-bit library is passed to a 16-bit or 32-bit library |
| function. |
| |
| PCRE2_ERROR_BADOFFSET |
| |
| The value of startoffset was greater than the length of the subject. |
| |
| PCRE2_ERROR_BADOPTION |
| |
| An unrecognized bit was set in the options argument. |
| |
| PCRE2_ERROR_BADUTFOFFSET |
| |
| The UTF code unit sequence that was passed as a subject was checked and |
| found to be valid (the PCRE2_NO_UTF_CHECK option was not set), but the |
| value of startoffset did not point to the beginning of a UTF character |
| or the end of the subject. |
| |
| PCRE2_ERROR_CALLOUT |
| |
| This error is never generated by pcre2_match() itself. It is provided |
| for use by callout functions that want to cause pcre2_match() or |
| pcre2_callout_enumerate() to return a distinctive error code. See the |
| pcre2callout documentation for details. |
| |
| PCRE2_ERROR_DEPTHLIMIT |
| |
| The nested backtracking depth limit was reached. |
| |
| PCRE2_ERROR_HEAPLIMIT |
| |
| The heap limit was reached. |
| |
| PCRE2_ERROR_INTERNAL |
| |
| An unexpected internal error has occurred. This error could be caused |
| by a bug in PCRE2 or by overwriting of the compiled pattern. |
| |
| PCRE2_ERROR_JIT_STACKLIMIT |
| |
| This error is returned when a pattern that was successfully studied us- |
| ing JIT is being matched, but the memory available for the just-in-time |
| processing stack is not large enough. See the pcre2jit documentation |
| for more details. |
| |
| PCRE2_ERROR_MATCHLIMIT |
| |
| The backtracking match limit was reached. |
| |
| PCRE2_ERROR_NOMEMORY |
| |
| Heap memory is used to remember backgracking points. This error is |
| given when the memory allocation function (default or custom) fails. |
| Note that a different error, PCRE2_ERROR_HEAPLIMIT, is given if the |
| amount of memory needed exceeds the heap limit. PCRE2_ERROR_NOMEMORY is |
| also returned if PCRE2_COPY_MATCHED_SUBJECT is set and memory alloca- |
| tion fails. |
| |
| PCRE2_ERROR_NULL |
| |
| Either the code, subject, or match_data argument was passed as NULL. |
| |
| PCRE2_ERROR_RECURSELOOP |
| |
| This error is returned when pcre2_match() detects a recursion loop |
| within the pattern. Specifically, it means that either the whole pat- |
| tern or a capture group has been called recursively for the second time |
| at the same position in the subject string. Some simple patterns that |
| might do this are detected and faulted at compile time, but more com- |
| plicated cases, in particular mutual recursions between two different |
| groups, cannot be detected until matching is attempted. |
| |
| |
| OBTAINING A TEXTUAL ERROR MESSAGE |
| |
| int pcre2_get_error_message(int errorcode, PCRE2_UCHAR *buffer, |
| PCRE2_SIZE bufflen); |
| |
| A text message for an error code from any PCRE2 function (compile, |
| match, or auxiliary) can be obtained by calling pcre2_get_error_mes- |
| sage(). The code is passed as the first argument, with the remaining |
| two arguments specifying a code unit buffer and its length in code |
| units, into which the text message is placed. The message is returned |
| in code units of the appropriate width for the library that is being |
| used. |
| |
| The returned message is terminated with a trailing zero, and the func- |
| tion returns the number of code units used, excluding the trailing |
| zero. If the error number is unknown, the negative error code PCRE2_ER- |
| ROR_BADDATA is returned. If the buffer is too small, the message is |
| truncated (but still with a trailing zero), and the negative error code |
| PCRE2_ERROR_NOMEMORY is returned. None of the messages are very long; |
| a buffer size of 120 code units is ample. |
| |
| |
| EXTRACTING CAPTURED SUBSTRINGS BY NUMBER |
| |
| int pcre2_substring_length_bynumber(pcre2_match_data *match_data, |
| uint32_t number, PCRE2_SIZE *length); |
| |
| int pcre2_substring_copy_bynumber(pcre2_match_data *match_data, |
| uint32_t number, PCRE2_UCHAR *buffer, |
| PCRE2_SIZE *bufflen); |
| |
| int pcre2_substring_get_bynumber(pcre2_match_data *match_data, |
| uint32_t number, PCRE2_UCHAR **bufferptr, |
| PCRE2_SIZE *bufflen); |
| |
| void pcre2_substring_free(PCRE2_UCHAR *buffer); |
| |
| Captured substrings can be accessed directly by using the ovector as |
| described above. For convenience, auxiliary functions are provided for |
| extracting captured substrings as new, separate, zero-terminated |
| strings. A substring that contains a binary zero is correctly extracted |
| and has a further zero added on the end, but the result is not, of |
| course, a C string. |
| |
| The functions in this section identify substrings by number. The number |
| zero refers to the entire matched substring, with higher numbers refer- |
| ring to substrings captured by parenthesized groups. After a partial |
| match, only substring zero is available. An attempt to extract any |
| other substring gives the error PCRE2_ERROR_PARTIAL. The next section |
| describes similar functions for extracting captured substrings by name. |
| |
| If a pattern uses the \K escape sequence within a positive assertion, |
| the reported start of a successful match can be greater than the end of |
| the match. For example, if the pattern (?=ab\K) is matched against |
| "ab", the start and end offset values for the match are 2 and 0. In |
| this situation, calling these functions with a zero substring number |
| extracts a zero-length empty string. |
| |
| You can find the length in code units of a captured substring without |
| extracting it by calling pcre2_substring_length_bynumber(). The first |
| argument is a pointer to the match data block, the second is the group |
| number, and the third is a pointer to a variable into which the length |
| is placed. If you just want to know whether or not the substring has |
| been captured, you can pass the third argument as NULL. |
| |
| The pcre2_substring_copy_bynumber() function copies a captured sub- |
| string into a supplied buffer, whereas pcre2_substring_get_bynumber() |
| copies it into new memory, obtained using the same memory allocation |
| function that was used for the match data block. The first two argu- |
| ments of these functions are a pointer to the match data block and a |
| capture group number. |
| |
| The final arguments of pcre2_substring_copy_bynumber() are a pointer to |
| the buffer and a pointer to a variable that contains its length in code |
| units. This is updated to contain the actual number of code units used |
| for the extracted substring, excluding the terminating zero. |
| |
| For pcre2_substring_get_bynumber() the third and fourth arguments point |
| to variables that are updated with a pointer to the new memory and the |
| number of code units that comprise the substring, again excluding the |
| terminating zero. When the substring is no longer needed, the memory |
| should be freed by calling pcre2_substring_free(). |
| |
| The return value from all these functions is zero for success, or a |
| negative error code. If the pattern match failed, the match failure |
| code is returned. If a substring number greater than zero is used af- |
| ter a partial match, PCRE2_ERROR_PARTIAL is returned. Other possible |
| error codes are: |
| |
| PCRE2_ERROR_NOMEMORY |
| |
| The buffer was too small for pcre2_substring_copy_bynumber(), or the |
| attempt to get memory failed for pcre2_substring_get_bynumber(). |
| |
| PCRE2_ERROR_NOSUBSTRING |
| |
| There is no substring with that number in the pattern, that is, the |
| number is greater than the number of capturing parentheses. |
| |
| PCRE2_ERROR_UNAVAILABLE |
| |
| The substring number, though not greater than the number of captures in |
| the pattern, is greater than the number of slots in the ovector, so the |
| substring could not be captured. |
| |
| PCRE2_ERROR_UNSET |
| |
| The substring did not participate in the match. For example, if the |
| pattern is (abc)|(def) and the subject is "def", and the ovector con- |
| tains at least two capturing slots, substring number 1 is unset. |
| |
| |
| EXTRACTING A LIST OF ALL CAPTURED SUBSTRINGS |
| |
| int pcre2_substring_list_get(pcre2_match_data *match_data, |
| PCRE2_UCHAR ***listptr, PCRE2_SIZE **lengthsptr); |
| |
| void pcre2_substring_list_free(PCRE2_SPTR *list); |
| |
| The pcre2_substring_list_get() function extracts all available sub- |
| strings and builds a list of pointers to them. It also (optionally) |
| builds a second list that contains their lengths (in code units), ex- |
| cluding a terminating zero that is added to each of them. All this is |
| done in a single block of memory that is obtained using the same memory |
| allocation function that was used to get the match data block. |
| |
| This function must be called only after a successful match. If called |
| after a partial match, the error code PCRE2_ERROR_PARTIAL is returned. |
| |
| The address of the memory block is returned via listptr, which is also |
| the start of the list of string pointers. The end of the list is marked |
| by a NULL pointer. The address of the list of lengths is returned via |
| lengthsptr. If your strings do not contain binary zeros and you do not |
| therefore need the lengths, you may supply NULL as the lengthsptr argu- |
| ment to disable the creation of a list of lengths. The yield of the |
| function is zero if all went well, or PCRE2_ERROR_NOMEMORY if the mem- |
| ory block could not be obtained. When the list is no longer needed, it |
| should be freed by calling pcre2_substring_list_free(). |
| |
| If this function encounters a substring that is unset, which can happen |
| when capture group number n+1 matches some part of the subject, but |
| group n has not been used at all, it returns an empty string. This can |
| be distinguished from a genuine zero-length substring by inspecting the |
| appropriate offset in the ovector, which contain PCRE2_UNSET for unset |
| substrings, or by calling pcre2_substring_length_bynumber(). |
| |
| |
| EXTRACTING CAPTURED SUBSTRINGS BY NAME |
| |
| int pcre2_substring_number_from_name(const pcre2_code *code, |
| PCRE2_SPTR name); |
| |
| int pcre2_substring_length_byname(pcre2_match_data *match_data, |
| PCRE2_SPTR name, PCRE2_SIZE *length); |
| |
| int pcre2_substring_copy_byname(pcre2_match_data *match_data, |
| PCRE2_SPTR name, PCRE2_UCHAR *buffer, PCRE2_SIZE *bufflen); |
| |
| int pcre2_substring_get_byname(pcre2_match_data *match_data, |
| PCRE2_SPTR name, PCRE2_UCHAR **bufferptr, PCRE2_SIZE *bufflen); |
| |
| void pcre2_substring_free(PCRE2_UCHAR *buffer); |
| |
| To extract a substring by name, you first have to find associated num- |
| ber. For example, for this pattern: |
| |
| (a+)b(?<xxx>\d+)... |
| |
| the number of the capture group called "xxx" is 2. If the name is known |
| to be unique (PCRE2_DUPNAMES was not set), you can find the number from |
| the name by calling pcre2_substring_number_from_name(). The first argu- |
| ment is the compiled pattern, and the second is the name. The yield of |
| the function is the group number, PCRE2_ERROR_NOSUBSTRING if there is |
| no group with that name, or PCRE2_ERROR_NOUNIQUESUBSTRING if there is |
| more than one group with that name. Given the number, you can extract |
| the substring directly from the ovector, or use one of the "bynumber" |
| functions described above. |
| |
| For convenience, there are also "byname" functions that correspond to |
| the "bynumber" functions, the only difference being that the second ar- |
| gument is a name instead of a number. If PCRE2_DUPNAMES is set and |
| there are duplicate names, these functions scan all the groups with the |
| given name, and return the captured substring from the first named |
| group that is set. |
| |
| If there are no groups with the given name, PCRE2_ERROR_NOSUBSTRING is |
| returned. If all groups with the name have numbers that are greater |
| than the number of slots in the ovector, PCRE2_ERROR_UNAVAILABLE is re- |
| turned. If there is at least one group with a slot in the ovector, but |
| no group is found to be set, PCRE2_ERROR_UNSET is returned. |
| |
| Warning: If the pattern uses the (?| feature to set up multiple capture |
| groups with the same number, as described in the section on duplicate |
| group numbers in the pcre2pattern page, you cannot use names to distin- |
| guish the different capture groups, because names are not included in |
| the compiled code. The matching process uses only numbers. For this |
| reason, the use of different names for groups with the same number |
| causes an error at compile time. |
| |
| |
| CREATING A NEW STRING WITH SUBSTITUTIONS |
| |
| int pcre2_substitute(const pcre2_code *code, PCRE2_SPTR subject, |
| PCRE2_SIZE length, PCRE2_SIZE startoffset, |
| uint32_t options, pcre2_match_data *match_data, |
| pcre2_match_context *mcontext, PCRE2_SPTR replacement, |
| PCRE2_SIZE rlength, PCRE2_UCHAR *outputbuffer, |
| PCRE2_SIZE *outlengthptr); |
| |
| This function optionally calls pcre2_match() and then makes a copy of |
| the subject string in outputbuffer, replacing parts that were matched |
| with the replacement string, whose length is supplied in rlength, which |
| can be given as PCRE2_ZERO_TERMINATED for a zero-terminated string. As |
| a special case, if replacement is NULL and rlength is zero, the re- |
| placement is assumed to be an empty string. If rlength is non-zero, an |
| error occurs if replacement is NULL. |
| |
| There is an option (see PCRE2_SUBSTITUTE_REPLACEMENT_ONLY below) to re- |
| turn just the replacement string(s). The default action is to perform |
| just one replacement if the pattern matches, but there is an option |
| that requests multiple replacements (see PCRE2_SUBSTITUTE_GLOBAL be- |
| low). |
| |
| If successful, pcre2_substitute() returns the number of substitutions |
| that were carried out. This may be zero if no match was found, and is |
| never greater than one unless PCRE2_SUBSTITUTE_GLOBAL is set. A nega- |
| tive value is returned if an error is detected. |
| |
| Matches in which a \K item in a lookahead in the pattern causes the |
| match to end before it starts are not supported, and give rise to an |
| error return. For global replacements, matches in which \K in a lookbe- |
| hind causes the match to start earlier than the point that was reached |
| in the previous iteration are also not supported. |
| |
| The first seven arguments of pcre2_substitute() are the same as for |
| pcre2_match(), except that the partial matching options are not permit- |
| ted, and match_data may be passed as NULL, in which case a match data |
| block is obtained and freed within this function, using memory manage- |
| ment functions from the match context, if provided, or else those that |
| were used to allocate memory for the compiled code. |
| |
| If match_data is not NULL and PCRE2_SUBSTITUTE_MATCHED is not set, the |
| provided block is used for all calls to pcre2_match(), and its contents |
| afterwards are the result of the final call. For global changes, this |
| will always be a no-match error. The contents of the ovector within the |
| match data block may or may not have been changed. |
| |
| As well as the usual options for pcre2_match(), a number of additional |
| options can be set in the options argument of pcre2_substitute(). One |
| such option is PCRE2_SUBSTITUTE_MATCHED. When this is set, an external |
| match_data block must be provided, and it must have already been used |
| for an external call to pcre2_match() with the same pattern and subject |
| arguments. The data in the match_data block (return code, offset vec- |
| tor) is then used for the first substitution instead of calling |
| pcre2_match() from within pcre2_substitute(). This allows an applica- |
| tion to check for a match before choosing to substitute, without having |
| to repeat the match. |
| |
| The contents of the externally supplied match data block are not |
| changed when PCRE2_SUBSTITUTE_MATCHED is set. If PCRE2_SUBSTI- |
| TUTE_GLOBAL is also set, pcre2_match() is called after the first sub- |
| stitution to check for further matches, but this is done using an in- |
| ternally obtained match data block, thus always leaving the external |
| block unchanged. |
| |
| The code argument is not used for matching before the first substitu- |
| tion when PCRE2_SUBSTITUTE_MATCHED is set, but it must be provided, |
| even when PCRE2_SUBSTITUTE_GLOBAL is not set, because it contains in- |
| formation such as the UTF setting and the number of capturing parenthe- |
| ses in the pattern. |
| |
| The default action of pcre2_substitute() is to return a copy of the |
| subject string with matched substrings replaced. However, if PCRE2_SUB- |
| STITUTE_REPLACEMENT_ONLY is set, only the replacement substrings are |
| returned. In the global case, multiple replacements are concatenated in |
| the output buffer. Substitution callouts (see below) can be used to |
| separate them if necessary. |
| |
| The outlengthptr argument of pcre2_substitute() must point to a vari- |
| able that contains the length, in code units, of the output buffer. If |
| the function is successful, the value is updated to contain the length |
| in code units of the new string, excluding the trailing zero that is |
| automatically added. |
| |
| If the function is not successful, the value set via outlengthptr de- |
| pends on the type of error. For syntax errors in the replacement |
| string, the value is the offset in the replacement string where the er- |
| ror was detected. For other errors, the value is PCRE2_UNSET by de- |
| fault. This includes the case of the output buffer being too small, un- |
| less PCRE2_SUBSTITUTE_OVERFLOW_LENGTH is set. |
| |
| PCRE2_SUBSTITUTE_OVERFLOW_LENGTH changes what happens when the output |
| buffer is too small. The default action is to return PCRE2_ERROR_NOMEM- |
| ORY immediately. If this option is set, however, pcre2_substitute() |
| continues to go through the motions of matching and substituting (with- |
| out, of course, writing anything) in order to compute the size of buf- |
| fer that is needed. This value is passed back via the outlengthptr |
| variable, with the result of the function still being PCRE2_ER- |
| ROR_NOMEMORY. |
| |
| Passing a buffer size of zero is a permitted way of finding out how |
| much memory is needed for given substitution. However, this does mean |
| that the entire operation is carried out twice. Depending on the appli- |
| cation, it may be more efficient to allocate a large buffer and free |
| the excess afterwards, instead of using PCRE2_SUBSTITUTE_OVER- |
| FLOW_LENGTH. |
| |
| The replacement string, which is interpreted as a UTF string in UTF |
| mode, is checked for UTF validity unless PCRE2_NO_UTF_CHECK is set. An |
| invalid UTF replacement string causes an immediate return with the rel- |
| evant UTF error code. |
| |
| If PCRE2_SUBSTITUTE_LITERAL is set, the replacement string is not in- |
| terpreted in any way. By default, however, a dollar character is an es- |
| cape character that can specify the insertion of characters from cap- |
| ture groups and names from (*MARK) or other control verbs in the pat- |
| tern. The following forms are always recognized: |
| |
| $$ insert a dollar character |
| $<n> or ${<n>} insert the contents of group <n> |
| $*MARK or ${*MARK} insert a control verb name |
| |
| Either a group number or a group name can be given for <n>. Curly |
| brackets are required only if the following character would be inter- |
| preted as part of the number or name. The number may be zero to include |
| the entire matched string. For example, if the pattern a(b)c is |
| matched with "=abc=" and the replacement string "+$1$0$1+", the result |
| is "=+babcb+=". |
| |
| $*MARK inserts the name from the last encountered backtracking control |
| verb on the matching path that has a name. (*MARK) must always include |
| a name, but the other verbs need not. For example, in the case of |
| (*MARK:A)(*PRUNE) the name inserted is "A", but for (*MARK:A)(*PRUNE:B) |
| the relevant name is "B". This facility can be used to perform simple |
| simultaneous substitutions, as this pcre2test example shows: |
| |
| /(*MARK:pear)apple|(*MARK:orange)lemon/g,replace=${*MARK} |
| apple lemon |
| 2: pear orange |
| |
| PCRE2_SUBSTITUTE_GLOBAL causes the function to iterate over the subject |
| string, replacing every matching substring. If this option is not set, |
| only the first matching substring is replaced. The search for matches |
| takes place in the original subject string (that is, previous replace- |
| ments do not affect it). Iteration is implemented by advancing the |
| startoffset value for each search, which is always passed the entire |
| subject string. If an offset limit is set in the match context, search- |
| ing stops when that limit is reached. |
| |
| You can restrict the effect of a global substitution to a portion of |
| the subject string by setting either or both of startoffset and an off- |
| set limit. Here is a pcre2test example: |
| |
| /B/g,replace=!,use_offset_limit |
| ABC ABC ABC ABC\=offset=3,offset_limit=12 |
| 2: ABC A!C A!C ABC |
| |
| When continuing with global substitutions after matching a substring |
| with zero length, an attempt to find a non-empty match at the same off- |
| set is performed. If this is not successful, the offset is advanced by |
| one character except when CRLF is a valid newline sequence and the next |
| two characters are CR, LF. In this case, the offset is advanced by two |
| characters. |
| |
| PCRE2_SUBSTITUTE_UNKNOWN_UNSET causes references to capture groups that |
| do not appear in the pattern to be treated as unset groups. This option |
| should be used with care, because it means that a typo in a group name |
| or number no longer causes the PCRE2_ERROR_NOSUBSTRING error. |
| |
| PCRE2_SUBSTITUTE_UNSET_EMPTY causes unset capture groups (including un- |
| known groups when PCRE2_SUBSTITUTE_UNKNOWN_UNSET is set) to be treated |
| as empty strings when inserted as described above. If this option is |
| not set, an attempt to insert an unset group causes the PCRE2_ERROR_UN- |
| SET error. This option does not influence the extended substitution |
| syntax described below. |
| |
| PCRE2_SUBSTITUTE_EXTENDED causes extra processing to be applied to the |
| replacement string. Without this option, only the dollar character is |
| special, and only the group insertion forms listed above are valid. |
| When PCRE2_SUBSTITUTE_EXTENDED is set, two things change: |
| |
| Firstly, backslash in a replacement string is interpreted as an escape |
| character. The usual forms such as \n or \x{ddd} can be used to specify |
| particular character codes, and backslash followed by any non-alphanu- |
| meric character quotes that character. Extended quoting can be coded |
| using \Q...\E, exactly as in pattern strings. |
| |
| There are also four escape sequences for forcing the case of inserted |
| letters. The insertion mechanism has three states: no case forcing, |
| force upper case, and force lower case. The escape sequences change the |
| current state: \U and \L change to upper or lower case forcing, respec- |
| tively, and \E (when not terminating a \Q quoted sequence) reverts to |
| no case forcing. The sequences \u and \l force the next character (if |
| it is a letter) to upper or lower case, respectively, and then the |
| state automatically reverts to no case forcing. Case forcing applies to |
| all inserted characters, including those from capture groups and let- |
| ters within \Q...\E quoted sequences. If either PCRE2_UTF or PCRE2_UCP |
| was set when the pattern was compiled, Unicode properties are used for |
| case forcing characters whose code points are greater than 127. |
| |
| Note that case forcing sequences such as \U...\E do not nest. For exam- |
| ple, the result of processing "\Uaa\LBB\Ecc\E" is "AAbbcc"; the final |
| \E has no effect. Note also that the PCRE2_ALT_BSUX and PCRE2_EX- |
| TRA_ALT_BSUX options do not apply to replacement strings. |
| |
| The second effect of setting PCRE2_SUBSTITUTE_EXTENDED is to add more |
| flexibility to capture group substitution. The syntax is similar to |
| that used by Bash: |
| |
| ${<n>:-<string>} |
| ${<n>:+<string1>:<string2>} |
| |
| As before, <n> may be a group number or a name. The first form speci- |
| fies a default value. If group <n> is set, its value is inserted; if |
| not, <string> is expanded and the result inserted. The second form |
| specifies strings that are expanded and inserted when group <n> is set |
| or unset, respectively. The first form is just a convenient shorthand |
| for |
| |
| ${<n>:+${<n>}:<string>} |
| |
| Backslash can be used to escape colons and closing curly brackets in |
| the replacement strings. A change of the case forcing state within a |
| replacement string remains in force afterwards, as shown in this |
| pcre2test example: |
| |
| /(some)?(body)/substitute_extended,replace=${1:+\U:\L}HeLLo |
| body |
| 1: hello |
| somebody |
| 1: HELLO |
| |
| The PCRE2_SUBSTITUTE_UNSET_EMPTY option does not affect these extended |
| substitutions. However, PCRE2_SUBSTITUTE_UNKNOWN_UNSET does cause un- |
| known groups in the extended syntax forms to be treated as unset. |
| |
| If PCRE2_SUBSTITUTE_LITERAL is set, PCRE2_SUBSTITUTE_UNKNOWN_UNSET, |
| PCRE2_SUBSTITUTE_UNSET_EMPTY, and PCRE2_SUBSTITUTE_EXTENDED are irrele- |
| vant and are ignored. |
| |
| Substitution errors |
| |
| In the event of an error, pcre2_substitute() returns a negative error |
| code. Except for PCRE2_ERROR_NOMATCH (which is never returned), errors |
| from pcre2_match() are passed straight back. |
| |
| PCRE2_ERROR_NOSUBSTRING is returned for a non-existent substring inser- |
| tion, unless PCRE2_SUBSTITUTE_UNKNOWN_UNSET is set. |
| |
| PCRE2_ERROR_UNSET is returned for an unset substring insertion (includ- |
| ing an unknown substring when PCRE2_SUBSTITUTE_UNKNOWN_UNSET is set) |
| when the simple (non-extended) syntax is used and PCRE2_SUBSTITUTE_UN- |
| SET_EMPTY is not set. |
| |
| PCRE2_ERROR_NOMEMORY is returned if the output buffer is not big |
| enough. If the PCRE2_SUBSTITUTE_OVERFLOW_LENGTH option is set, the size |
| of buffer that is needed is returned via outlengthptr. Note that this |
| does not happen by default. |
| |
| PCRE2_ERROR_NULL is returned if PCRE2_SUBSTITUTE_MATCHED is set but the |
| match_data argument is NULL or if the subject or replacement arguments |
| are NULL. For backward compatibility reasons an exception is made for |
| the replacement argument if the rlength argument is also 0. |
| |
| PCRE2_ERROR_BADREPLACEMENT is used for miscellaneous syntax errors in |
| the replacement string, with more particular errors being PCRE2_ER- |
| ROR_BADREPESCAPE (invalid escape sequence), PCRE2_ERROR_REPMISSINGBRACE |
| (closing curly bracket not found), PCRE2_ERROR_BADSUBSTITUTION (syntax |
| error in extended group substitution), and PCRE2_ERROR_BADSUBSPATTERN |
| (the pattern match ended before it started or the match started earlier |
| than the current position in the subject, which can happen if \K is |
| used in an assertion). |
| |
| As for all PCRE2 errors, a text message that describes the error can be |
| obtained by calling the pcre2_get_error_message() function (see "Ob- |
| taining a textual error message" above). |
| |
| Substitution callouts |
| |
| int pcre2_set_substitute_callout(pcre2_match_context *mcontext, |
| int (*callout_function)(pcre2_substitute_callout_block *, void *), |
| void *callout_data); |
| |
| The pcre2_set_substitution_callout() function can be used to specify a |
| callout function for pcre2_substitute(). This information is passed in |
| a match context. The callout function is called after each substitution |
| has been processed, but it can cause the replacement not to happen. The |
| callout function is not called for simulated substitutions that happen |
| as a result of the PCRE2_SUBSTITUTE_OVERFLOW_LENGTH option. |
| |
| The first argument of the callout function is a pointer to a substitute |
| callout block structure, which contains the following fields, not nec- |
| essarily in this order: |
| |
| uint32_t version; |
| uint32_t subscount; |
| PCRE2_SPTR input; |
| PCRE2_SPTR output; |
| PCRE2_SIZE *ovector; |
| uint32_t oveccount; |
| PCRE2_SIZE output_offsets[2]; |
| |
| The version field contains the version number of the block format. The |
| current version is 0. The version number will increase in future if |
| more fields are added, but the intention is never to remove any of the |
| existing fields. |
| |
| The subscount field is the number of the current match. It is 1 for the |
| first callout, 2 for the second, and so on. The input and output point- |
| ers are copies of the values passed to pcre2_substitute(). |
| |
| The ovector field points to the ovector, which contains the result of |
| the most recent match. The oveccount field contains the number of pairs |
| that are set in the ovector, and is always greater than zero. |
| |
| The output_offsets vector contains the offsets of the replacement in |
| the output string. This has already been processed for dollar and (if |
| requested) backslash substitutions as described above. |
| |
| The second argument of the callout function is the value passed as |
| callout_data when the function was registered. The value returned by |
| the callout function is interpreted as follows: |
| |
| If the value is zero, the replacement is accepted, and, if PCRE2_SUB- |
| STITUTE_GLOBAL is set, processing continues with a search for the next |
| match. If the value is not zero, the current replacement is not ac- |
| cepted. If the value is greater than zero, processing continues when |
| PCRE2_SUBSTITUTE_GLOBAL is set. Otherwise (the value is less than zero |
| or PCRE2_SUBSTITUTE_GLOBAL is not set), the the rest of the input is |
| copied to the output and the call to pcre2_substitute() exits, return- |
| ing the number of matches so far. |
| |
| |
| DUPLICATE CAPTURE GROUP NAMES |
| |
| int pcre2_substring_nametable_scan(const pcre2_code *code, |
| PCRE2_SPTR name, PCRE2_SPTR *first, PCRE2_SPTR *last); |
| |
| When a pattern is compiled with the PCRE2_DUPNAMES option, names for |
| capture groups are not required to be unique. Duplicate names are al- |
| ways allowed for groups with the same number, created by using the (?| |
| feature. Indeed, if such groups are named, they are required to use the |
| same names. |
| |
| Normally, patterns that use duplicate names are such that in any one |
| match, only one of each set of identically-named groups participates. |
| An example is shown in the pcre2pattern documentation. |
| |
| When duplicates are present, pcre2_substring_copy_byname() and |
| pcre2_substring_get_byname() return the first substring corresponding |
| to the given name that is set. Only if none are set is PCRE2_ERROR_UN- |
| SET is returned. The pcre2_substring_number_from_name() function re- |
| turns the error PCRE2_ERROR_NOUNIQUESUBSTRING when there are duplicate |
| names. |
| |
| If you want to get full details of all captured substrings for a given |
| name, you must use the pcre2_substring_nametable_scan() function. The |
| first argument is the compiled pattern, and the second is the name. If |
| the third and fourth arguments are NULL, the function returns a group |
| number for a unique name, or PCRE2_ERROR_NOUNIQUESUBSTRING otherwise. |
| |
| When the third and fourth arguments are not NULL, they must be pointers |
| to variables that are updated by the function. After it has run, they |
| point to the first and last entries in the name-to-number table for the |
| given name, and the function returns the length of each entry in code |
| units. In both cases, PCRE2_ERROR_NOSUBSTRING is returned if there are |
| no entries for the given name. |
| |
| The format of the name table is described above in the section entitled |
| Information about a pattern. Given all the relevant entries for the |
| name, you can extract each of their numbers, and hence the captured |
| data. |
| |
| |
| FINDING ALL POSSIBLE MATCHES AT ONE POSITION |
| |
| The traditional matching function uses a similar algorithm to Perl, |
| which stops when it finds the first match at a given point in the sub- |
| ject. If you want to find all possible matches, or the longest possible |
| match at a given position, consider using the alternative matching |
| function (see below) instead. If you cannot use the alternative func- |
| tion, you can kludge it up by making use of the callout facility, which |
| is described in the pcre2callout documentation. |
| |
| What you have to do is to insert a callout right at the end of the pat- |
| tern. When your callout function is called, extract and save the cur- |
| rent matched substring. Then return 1, which forces pcre2_match() to |
| backtrack and try other alternatives. Ultimately, when it runs out of |
| matches, pcre2_match() will yield PCRE2_ERROR_NOMATCH. |
| |
| |
| MATCHING A PATTERN: THE ALTERNATIVE FUNCTION |
| |
| int pcre2_dfa_match(const pcre2_code *code, PCRE2_SPTR subject, |
| PCRE2_SIZE length, PCRE2_SIZE startoffset, |
| uint32_t options, pcre2_match_data *match_data, |
| pcre2_match_context *mcontext, |
| int *workspace, PCRE2_SIZE wscount); |
| |
| The function pcre2_dfa_match() is called to match a subject string |
| against a compiled pattern, using a matching algorithm that scans the |
| subject string just once (not counting lookaround assertions), and does |
| not backtrack (except when processing lookaround assertions). This has |
| different characteristics to the normal algorithm, and is not compati- |
| ble with Perl. Some of the features of PCRE2 patterns are not sup- |
| ported. Nevertheless, there are times when this kind of matching can be |
| useful. For a discussion of the two matching algorithms, and a list of |
| features that pcre2_dfa_match() does not support, see the pcre2matching |
| documentation. |
| |
| The arguments for the pcre2_dfa_match() function are the same as for |
| pcre2_match(), plus two extras. The ovector within the match data block |
| is used in a different way, and this is described below. The other com- |
| mon arguments are used in the same way as for pcre2_match(), so their |
| description is not repeated here. |
| |
| The two additional arguments provide workspace for the function. The |
| workspace vector should contain at least 20 elements. It is used for |
| keeping track of multiple paths through the pattern tree. More |
| workspace is needed for patterns and subjects where there are a lot of |
| potential matches. |
| |
| Here is an example of a simple call to pcre2_dfa_match(): |
| |
| int wspace[20]; |
| pcre2_match_data *md = pcre2_match_data_create(4, NULL); |
| int rc = pcre2_dfa_match( |
| re, /* result of pcre2_compile() */ |
| "some string", /* the subject string */ |
| 11, /* the length of the subject string */ |
| 0, /* start at offset 0 in the subject */ |
| 0, /* default options */ |
| md, /* the match data block */ |
| NULL, /* a match context; NULL means use defaults */ |
| wspace, /* working space vector */ |
| 20); /* number of elements (NOT size in bytes) */ |
| |
| Option bits for pcre2_dfa_match() |
| |
| The unused bits of the options argument for pcre2_dfa_match() must be |
| zero. The only bits that may be set are PCRE2_ANCHORED, |
| PCRE2_COPY_MATCHED_SUBJECT, PCRE2_ENDANCHORED, PCRE2_NOTBOL, PCRE2_NO- |
| TEOL, PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART, PCRE2_NO_UTF_CHECK, |
| PCRE2_PARTIAL_HARD, PCRE2_PARTIAL_SOFT, PCRE2_DFA_SHORTEST, and |
| PCRE2_DFA_RESTART. All but the last four of these are exactly the same |
| as for pcre2_match(), so their description is not repeated here. |
| |
| PCRE2_PARTIAL_HARD |
| PCRE2_PARTIAL_SOFT |
| |
| These have the same general effect as they do for pcre2_match(), but |
| the details are slightly different. When PCRE2_PARTIAL_HARD is set for |
| pcre2_dfa_match(), it returns PCRE2_ERROR_PARTIAL if the end of the |
| subject is reached and there is still at least one matching possibility |
| that requires additional characters. This happens even if some complete |
| matches have already been found. When PCRE2_PARTIAL_SOFT is set, the |
| return code PCRE2_ERROR_NOMATCH is converted into PCRE2_ERROR_PARTIAL |
| if the end of the subject is reached, there have been no complete |
| matches, but there is still at least one matching possibility. The por- |
| tion of the string that was inspected when the longest partial match |
| was found is set as the first matching string in both cases. There is a |
| more detailed discussion of partial and multi-segment matching, with |
| examples, in the pcre2partial documentation. |
| |
| PCRE2_DFA_SHORTEST |
| |
| Setting the PCRE2_DFA_SHORTEST option causes the matching algorithm to |
| stop as soon as it has found one match. Because of the way the alterna- |
| tive algorithm works, this is necessarily the shortest possible match |
| at the first possible matching point in the subject string. |
| |
| PCRE2_DFA_RESTART |
| |
| When pcre2_dfa_match() returns a partial match, it is possible to call |
| it again, with additional subject characters, and have it continue with |
| the same match. The PCRE2_DFA_RESTART option requests this action; when |
| it is set, the workspace and wscount options must reference the same |
| vector as before because data about the match so far is left in them |
| after a partial match. There is more discussion of this facility in the |
| pcre2partial documentation. |
| |
| Successful returns from pcre2_dfa_match() |
| |
| When pcre2_dfa_match() succeeds, it may have matched more than one sub- |
| string in the subject. Note, however, that all the matches from one run |
| of the function start at the same point in the subject. The shorter |
| matches are all initial substrings of the longer matches. For example, |
| if the pattern |
| |
| <.*> |
| |
| is matched against the string |
| |
| This is <something> <something else> <something further> no more |
| |
| the three matched strings are |
| |
| <something> <something else> <something further> |
| <something> <something else> |
| <something> |
| |
| On success, the yield of the function is a number greater than zero, |
| which is the number of matched substrings. The offsets of the sub- |
| strings are returned in the ovector, and can be extracted by number in |
| the same way as for pcre2_match(), but the numbers bear no relation to |
| any capture groups that may exist in the pattern, because DFA matching |
| does not support capturing. |
| |
| Calls to the convenience functions that extract substrings by name re- |
| turn the error PCRE2_ERROR_DFA_UFUNC (unsupported function) if used af- |
| ter a DFA match. The convenience functions that extract substrings by |
| number never return PCRE2_ERROR_NOSUBSTRING. |
| |
| The matched strings are stored in the ovector in reverse order of |
| length; that is, the longest matching string is first. If there were |
| too many matches to fit into the ovector, the yield of the function is |
| zero, and the vector is filled with the longest matches. |
| |
| NOTE: PCRE2's "auto-possessification" optimization usually applies to |
| character repeats at the end of a pattern (as well as internally). For |
| example, the pattern "a\d+" is compiled as if it were "a\d++". For DFA |
| matching, this means that only one possible match is found. If you re- |
| ally do want multiple matches in such cases, either use an ungreedy re- |
| peat such as "a\d+?" or set the PCRE2_NO_AUTO_POSSESS option when com- |
| piling. |
| |
| Error returns from pcre2_dfa_match() |
| |
| The pcre2_dfa_match() function returns a negative number when it fails. |
| Many of the errors are the same as for pcre2_match(), as described |
| above. There are in addition the following errors that are specific to |
| pcre2_dfa_match(): |
| |
| PCRE2_ERROR_DFA_UITEM |
| |
| This return is given if pcre2_dfa_match() encounters an item in the |
| pattern that it does not support, for instance, the use of \C in a UTF |
| mode or a backreference. |
| |
| PCRE2_ERROR_DFA_UCOND |
| |
| This return is given if pcre2_dfa_match() encounters a condition item |
| that uses a backreference for the condition, or a test for recursion in |
| a specific capture group. These are not supported. |
| |
| PCRE2_ERROR_DFA_UINVALID_UTF |
| |
| This return is given if pcre2_dfa_match() is called for a pattern that |
| was compiled with PCRE2_MATCH_INVALID_UTF. This is not supported for |
| DFA matching. |
| |
| PCRE2_ERROR_DFA_WSSIZE |
| |
| This return is given if pcre2_dfa_match() runs out of space in the |
| workspace vector. |
| |
| PCRE2_ERROR_DFA_RECURSE |
| |
| When a recursion or subroutine call is processed, the matching function |
| calls itself recursively, using private memory for the ovector and |
| workspace. This error is given if the internal ovector is not large |
| enough. This should be extremely rare, as a vector of size 1000 is |
| used. |
| |
| PCRE2_ERROR_DFA_BADRESTART |
| |
| When pcre2_dfa_match() is called with the PCRE2_DFA_RESTART option, |
| some plausibility checks are made on the contents of the workspace, |
| which should contain data about the previous partial match. If any of |
| these checks fail, this error is given. |
| |
| |
| SEE ALSO |
| |
| pcre2build(3), pcre2callout(3), pcre2demo(3), pcre2matching(3), |
| pcre2partial(3), pcre2posix(3), pcre2sample(3), pcre2unicode(3). |
| |
| |
| AUTHOR |
| |
| Philip Hazel |
| Retired from University Computing Service |
| Cambridge, England. |
| |
| |
| REVISION |
| |
| Last updated: 27 July 2022 |
| Copyright (c) 1997-2022 University of Cambridge. |
| ------------------------------------------------------------------------------ |
| |
| |
| PCRE2BUILD(3) Library Functions Manual PCRE2BUILD(3) |
| |
| |
| |
| NAME |
| PCRE2 - Perl-compatible regular expressions (revised API) |
| |
| BUILDING PCRE2 |
| |
| PCRE2 is distributed with a configure script that can be used to build |
| the library in Unix-like environments using the applications known as |
| Autotools. Also in the distribution are files to support building using |
| CMake instead of configure. The text file README contains general in- |
| formation about building with Autotools (some of which is repeated be- |
| low), and also has some comments about building on various operating |
| systems. There is a lot more information about building PCRE2 without |
| using Autotools (including information about using CMake and building |
| "by hand") in the text file called NON-AUTOTOOLS-BUILD. You should |
| consult this file as well as the README file if you are building in a |
| non-Unix-like environment. |
| |
| |
| PCRE2 BUILD-TIME OPTIONS |
| |
| The rest of this document describes the optional features of PCRE2 that |
| can be selected when the library is compiled. It assumes use of the |
| configure script, where the optional features are selected or dese- |
| lected by providing options to configure before running the make com- |
| mand. However, the same options can be selected in both Unix-like and |
| non-Unix-like environments if you are using CMake instead of configure |
| to build PCRE2. |
| |
| If you are not using Autotools or CMake, option selection can be done |
| by editing the config.h file, or by passing parameter settings to the |
| compiler, as described in NON-AUTOTOOLS-BUILD. |
| |
| The complete list of options for configure (which includes the standard |
| ones such as the selection of the installation directory) can be ob- |
| tained by running |
| |
| ./configure --help |
| |
| The following sections include descriptions of "on/off" options whose |
| names begin with --enable or --disable. Because of the way that config- |
| ure works, --enable and --disable always come in pairs, so the comple- |
| mentary option always exists as well, but as it specifies the default, |
| it is not described. Options that specify values have names that start |
| with --with. At the end of a configure run, a summary of the configura- |
| tion is output. |
| |
| |
| BUILDING 8-BIT, 16-BIT AND 32-BIT LIBRARIES |
| |
| By default, a library called libpcre2-8 is built, containing functions |
| that take string arguments contained in arrays of bytes, interpreted |
| either as single-byte characters, or UTF-8 strings. You can also build |
| two other libraries, called libpcre2-16 and libpcre2-32, which process |
| strings that are contained in arrays of 16-bit and 32-bit code units, |
| respectively. These can be interpreted either as single-unit characters |
| or UTF-16/UTF-32 strings. To build these additional libraries, add one |
| or both of the following to the configure command: |
| |
| --enable-pcre2-16 |
| --enable-pcre2-32 |
| |
| If you do not want the 8-bit library, add |
| |
| --disable-pcre2-8 |
| |
| as well. At least one of the three libraries must be built. Note that |
| the POSIX wrapper is for the 8-bit library only, and that pcre2grep is |
| an 8-bit program. Neither of these are built if you select only the |
| 16-bit or 32-bit libraries. |
| |
| |
| BUILDING SHARED AND STATIC LIBRARIES |
| |
| The Autotools PCRE2 building process uses libtool to build both shared |
| and static libraries by default. You can suppress an unwanted library |
| by adding one of |
| |
| --disable-shared |
| --disable-static |
| |
| to the configure command. |
| |
| |
| UNICODE AND UTF SUPPORT |
| |
| By default, PCRE2 is built with support for Unicode and UTF character |
| strings. To build it without Unicode support, add |
| |
| --disable-unicode |
| |
| to the configure command. This setting applies to all three libraries. |
| It is not possible to build one library with Unicode support and an- |
| other without in the same configuration. |
| |
| Of itself, Unicode support does not make PCRE2 treat strings as UTF-8, |
| UTF-16 or UTF-32. To do that, applications that use the library can set |
| the PCRE2_UTF option when they call pcre2_compile() to compile a pat- |
| tern. Alternatively, patterns may be started with (*UTF) unless the |
| application has locked this out by setting PCRE2_NEVER_UTF. |
| |
| UTF support allows the libraries to process character code points up to |
| 0x10ffff in the strings that they handle. Unicode support also gives |
| access to the Unicode properties of characters, using pattern escapes |
| such as \P, \p, and \X. Only the general category properties such as Lu |
| and Nd, script names, and some bi-directional properties are supported. |
| Details are given in the pcre2pattern documentation. |
| |
| Pattern escapes such as \d and \w do not by default make use of Unicode |
| properties. The application can request that they do by setting the |
| PCRE2_UCP option. Unless the application has set PCRE2_NEVER_UCP, a |
| pattern may also request this by starting with (*UCP). |
| |
| |
| DISABLING THE USE OF \C |
| |
| The \C escape sequence, which matches a single code unit, even in a UTF |
| mode, can cause unpredictable behaviour because it may leave the cur- |
| rent matching point in the middle of a multi-code-unit character. The |
| application can lock it out by setting the PCRE2_NEVER_BACKSLASH_C op- |
| tion when calling pcre2_compile(). There is also a build-time option |
| |
| --enable-never-backslash-C |
| |
| (note the upper case C) which locks out the use of \C entirely. |
| |
| |
| JUST-IN-TIME COMPILER SUPPORT |
| |
| Just-in-time (JIT) compiler support is included in the build by speci- |
| fying |
| |
| --enable-jit |
| |
| This support is available only for certain hardware architectures. If |
| this option is set for an unsupported architecture, a building error |
| occurs. If in doubt, use |
| |
| --enable-jit=auto |
| |
| which enables JIT only if the current hardware is supported. You can |
| check if JIT is enabled in the configuration summary that is output at |
| the end of a configure run. If you are enabling JIT under SELinux you |
| may also want to add |
| |
| --enable-jit-sealloc |
| |
| which enables the use of an execmem allocator in JIT that is compatible |
| with SELinux. This has no effect if JIT is not enabled. See the |
| pcre2jit documentation for a discussion of JIT usage. When JIT support |
| is enabled, pcre2grep automatically makes use of it, unless you add |
| |
| --disable-pcre2grep-jit |
| |
| to the configure command. |
| |
| |
| NEWLINE RECOGNITION |
| |
| By default, PCRE2 interprets the linefeed (LF) character as indicating |
| the end of a line. This is the normal newline character on Unix-like |
| systems. You can compile PCRE2 to use carriage return (CR) instead, by |
| adding |
| |
| --enable-newline-is-cr |
| |
| to the configure command. There is also an --enable-newline-is-lf op- |
| tion, which explicitly specifies linefeed as the newline character. |
| |
| Alternatively, you can specify that line endings are to be indicated by |
| the two-character sequence CRLF (CR immediately followed by LF). If you |
| want this, add |
| |
| --enable-newline-is-crlf |
| |
| to the configure command. There is a fourth option, specified by |
| |
| --enable-newline-is-anycrlf |
| |
| which causes PCRE2 to recognize any of the three sequences CR, LF, or |
| CRLF as indicating a line ending. A fifth option, specified by |
| |
| --enable-newline-is-any |
| |
| causes PCRE2 to recognize any Unicode newline sequence. The Unicode |
| newline sequences are the three just mentioned, plus the single charac- |
| ters VT (vertical tab, U+000B), FF (form feed, U+000C), NEL (next line, |
| U+0085), LS (line separator, U+2028), and PS (paragraph separator, |
| U+2029). The final option is |
| |
| --enable-newline-is-nul |
| |
| which causes NUL (binary zero) to be set as the default line-ending |
| character. |
| |
| Whatever default line ending convention is selected when PCRE2 is built |
| can be overridden by applications that use the library. At build time |
| it is recommended to use the standard for your operating system. |
| |
| |
| WHAT \R MATCHES |
| |
| By default, the sequence \R in a pattern matches any Unicode newline |
| sequence, independently of what has been selected as the line ending |
| sequence. If you specify |
| |
| --enable-bsr-anycrlf |
| |
| the default is changed so that \R matches only CR, LF, or CRLF. What- |
| ever is selected when PCRE2 is built can be overridden by applications |
| that use the library. |
| |
| |
| HANDLING VERY LARGE PATTERNS |
| |
| Within a compiled pattern, offset values are used to point from one |
| part to another (for example, from an opening parenthesis to an alter- |
| nation metacharacter). By default, in the 8-bit and 16-bit libraries, |
| two-byte values are used for these offsets, leading to a maximum size |
| for a compiled pattern of around 64 thousand code units. This is suffi- |
| cient to handle all but the most gigantic patterns. Nevertheless, some |
| people do want to process truly enormous patterns, so it is possible to |
| compile PCRE2 to use three-byte or four-byte offsets by adding a set- |
| ting such as |
| |
| --with-link-size=3 |
| |
| to the configure command. The value given must be 2, 3, or 4. For the |
| 16-bit library, a value of 3 is rounded up to 4. In these libraries, |
| using longer offsets slows down the operation of PCRE2 because it has |
| to load additional data when handling them. For the 32-bit library the |
| value is always 4 and cannot be overridden; the value of --with-link- |
| size is ignored. |
| |
| |
| LIMITING PCRE2 RESOURCE USAGE |
| |
| The pcre2_match() function increments a counter each time it goes round |
| its main loop. Putting a limit on this counter controls the amount of |
| computing resource used by a single call to pcre2_match(). The limit |
| can be changed at run time, as described in the pcre2api documentation. |
| The default is 10 million, but this can be changed by adding a setting |
| such as |
| |
| --with-match-limit=500000 |
| |
| to the configure command. This setting also applies to the |
| pcre2_dfa_match() matching function, and to JIT matching (though the |
| counting is done differently). |
| |
| The pcre2_match() function uses heap memory to record backtracking |
| points. The more nested backtracking points there are (that is, the |
| deeper the search tree), the more memory is needed. There is an upper |
| limit, specified in kibibytes (units of 1024 bytes). This limit can be |
| changed at run time, as described in the pcre2api documentation. The |
| default limit (in effect unlimited) is 20 million. You can change this |
| by a setting such as |
| |
| --with-heap-limit=500 |
| |
| which limits the amount of heap to 500 KiB. This limit applies only to |
| interpretive matching in pcre2_match() and pcre2_dfa_match(), which may |
| also use the heap for internal workspace when processing complicated |
| patterns. This limit does not apply when JIT (which has its own memory |
| arrangements) is used. |
| |
| You can also explicitly limit the depth of nested backtracking in the |
| pcre2_match() interpreter. This limit defaults to the value that is set |
| for --with-match-limit. You can set a lower default limit by adding, |
| for example, |
| |
| --with-match-limit-depth=10000 |
| |
| to the configure command. This value can be overridden at run time. |
| This depth limit indirectly limits the amount of heap memory that is |
| used, but because the size of each backtracking "frame" depends on the |
| number of capturing parentheses in a pattern, the amount of heap that |
| is used before the limit is reached varies from pattern to pattern. |
| This limit was more useful in versions before 10.30, where function re- |
| cursion was used for backtracking. |
| |
| As well as applying to pcre2_match(), the depth limit also controls the |
| depth of recursive function calls in pcre2_dfa_match(). These are used |
| for lookaround assertions, atomic groups, and recursion within pat- |
| terns. The limit does not apply to JIT matching. |
| |
| |
| CREATING CHARACTER TABLES AT BUILD TIME |
| |
| PCRE2 uses fixed tables for processing characters whose code points are |
| less than 256. By default, PCRE2 is built with a set of tables that are |
| distributed in the file src/pcre2_chartables.c.dist. These tables are |
| for ASCII codes only. If you add |
| |
| --enable-rebuild-chartables |
| |
| to the configure command, the distributed tables are no longer used. |
| Instead, a program called pcre2_dftables is compiled and run. This out- |
| puts the source for new set of tables, created in the default locale of |
| your C run-time system. This method of replacing the tables does not |
| work if you are cross compiling, because pcre2_dftables needs to be run |
| on the local host and therefore not compiled with the cross compiler. |
| |
| If you need to create alternative tables when cross compiling, you will |
| have to do so "by hand". There may also be other reasons for creating |
| tables manually. To cause pcre2_dftables to be built on the local |
| host, run a normal compiling command, and then run the program with the |
| output file as its argument, for example: |
| |
| cc src/pcre2_dftables.c -o pcre2_dftables |
| ./pcre2_dftables src/pcre2_chartables.c |
| |
| This builds the tables in the default locale of the local host. If you |
| want to specify a locale, you must use the -L option: |
| |
| LC_ALL=fr_FR ./pcre2_dftables -L src/pcre2_chartables.c |
| |
| You can also specify -b (with or without -L). This causes the tables to |
| be written in binary instead of as source code. A set of binary tables |
| can be loaded into memory by an application and passed to pcre2_com- |
| pile() in the same way as tables created by calling pcre2_maketables(). |
| The tables are just a string of bytes, independent of hardware charac- |
| teristics such as endianness. This means they can be bundled with an |
| application that runs in different environments, to ensure consistent |
| behaviour. |
| |
| |
| USING EBCDIC CODE |
| |
| PCRE2 assumes by default that it will run in an environment where the |
| character code is ASCII or Unicode, which is a superset of ASCII. This |
| is the case for most computer operating systems. PCRE2 can, however, be |
| compiled to run in an 8-bit EBCDIC environment by adding |
| |
| --enable-ebcdic --disable-unicode |
| |
| to the configure command. This setting implies --enable-rebuild-charta- |
| bles. You should only use it if you know that you are in an EBCDIC en- |
| vironment (for example, an IBM mainframe operating system). |
| |
| It is not possible to support both EBCDIC and UTF-8 codes in the same |
| version of the library. Consequently, --enable-unicode and --enable- |
| ebcdic are mutually exclusive. |
| |
| The EBCDIC character that corresponds to an ASCII LF is assumed to have |
| the value 0x15 by default. However, in some EBCDIC environments, 0x25 |
| is used. In such an environment you should use |
| |
| --enable-ebcdic-nl25 |
| |
| as well as, or instead of, --enable-ebcdic. The EBCDIC character for CR |
| has the same value as in ASCII, namely, 0x0d. Whichever of 0x15 and |
| 0x25 is not chosen as LF is made to correspond to the Unicode NEL char- |
| acter (which, in Unicode, is 0x85). |
| |
| The options that select newline behaviour, such as --enable-newline-is- |
| cr, and equivalent run-time options, refer to these character values in |
| an EBCDIC environment. |
| |
| |
| PCRE2GREP SUPPORT FOR EXTERNAL SCRIPTS |
| |
| By default pcre2grep supports the use of callouts with string arguments |
| within the patterns it is matching. There are two kinds: one that gen- |
| erates output using local code, and another that calls an external pro- |
| gram or script. If --disable-pcre2grep-callout-fork is added to the |
| configure command, only the first kind of callout is supported; if |
| --disable-pcre2grep-callout is used, all callouts are completely ig- |
| nored. For more details of pcre2grep callouts, see the pcre2grep docu- |
| mentation. |
| |
| |
| PCRE2GREP OPTIONS FOR COMPRESSED FILE SUPPORT |
| |
| By default, pcre2grep reads all files as plain text. You can build it |
| so that it recognizes files whose names end in .gz or .bz2, and reads |
| them with libz or libbz2, respectively, by adding one or both of |
| |
| --enable-pcre2grep-libz |
| --enable-pcre2grep-libbz2 |
| |
| to the configure command. These options naturally require that the rel- |
| evant libraries are installed on your system. Configuration will fail |
| if they are not. |
| |
| |
| PCRE2GREP BUFFER SIZE |
| |
| pcre2grep uses an internal buffer to hold a "window" on the file it is |
| scanning, in order to be able to output "before" and "after" lines when |
| it finds a match. The default starting size of the buffer is 20KiB. The |
| buffer itself is three times this size, but because of the way it is |
| used for holding "before" lines, the longest line that is guaranteed to |
| be processable is the notional buffer size. If a longer line is encoun- |
| tered, pcre2grep automatically expands the buffer, up to a specified |
| maximum size, whose default is 1MiB or the starting size, whichever is |
| the larger. You can change the default parameter values by adding, for |
| example, |
| |
| --with-pcre2grep-bufsize=51200 |
| --with-pcre2grep-max-bufsize=2097152 |
| |
| to the configure command. The caller of pcre2grep can override these |
| values by using --buffer-size and --max-buffer-size on the command |
| line. |
| |
| |
| PCRE2TEST OPTION FOR LIBREADLINE SUPPORT |
| |
| If you add one of |
| |
| --enable-pcre2test-libreadline |
| --enable-pcre2test-libedit |
| |
| to the configure command, pcre2test is linked with the libreadline or- |
| libedit library, respectively, and when its input is from a terminal, |
| it reads it using the readline() function. This provides line-editing |
| and history facilities. Note that libreadline is GPL-licensed, so if |
| you distribute a binary of pcre2test linked in this way, there may be |
| licensing issues. These can be avoided by linking instead with libedit, |
| which has a BSD licence. |
| |
| Setting --enable-pcre2test-libreadline causes the -lreadline option to |
| be added to the pcre2test build. In many operating environments with a |
| sytem-installed readline library this is sufficient. However, in some |
| environments (e.g. if an unmodified distribution version of readline is |
| in use), some extra configuration may be necessary. The INSTALL file |
| for libreadline says this: |
| |
| "Readline uses the termcap functions, but does not link with |
| the termcap or curses library itself, allowing applications |
| which link with readline the to choose an appropriate library." |
| |
| If your environment has not been set up so that an appropriate library |
| is automatically included, you may need to add something like |
| |
| LIBS="-ncurses" |
| |
| immediately before the configure command. |
| |
| |
| INCLUDING DEBUGGING CODE |
| |
| If you add |
| |
| --enable-debug |
| |
| to the configure command, additional debugging code is included in the |
| build. This feature is intended for use by the PCRE2 maintainers. |
| |
| |
| DEBUGGING WITH VALGRIND SUPPORT |
| |
| If you add |
| |
| --enable-valgrind |
| |
| to the configure command, PCRE2 will use valgrind annotations to mark |
| certain memory regions as unaddressable. This allows it to detect in- |
| valid memory accesses, and is mostly useful for debugging PCRE2 itself. |
| |
| |
| CODE COVERAGE REPORTING |
| |
| If your C compiler is gcc, you can build a version of PCRE2 that can |
| generate a code coverage report for its test suite. To enable this, you |
| must install lcov version 1.6 or above. Then specify |
| |
| --enable-coverage |
| |
| to the configure command and build PCRE2 in the usual way. |
| |
| Note that using ccache (a caching C compiler) is incompatible with code |
| coverage reporting. If you have configured ccache to run automatically |
| on your system, you must set the environment variable |
| |
| CCACHE_DISABLE=1 |
| |
| before running make to build PCRE2, so that ccache is not used. |
| |
| When --enable-coverage is used, the following addition targets are |
| added to the Makefile: |
| |
| make coverage |
| |
| This creates a fresh coverage report for the PCRE2 test suite. It is |
| equivalent to running "make coverage-reset", "make coverage-baseline", |
| "make check", and then "make coverage-report". |
| |
| make coverage-reset |
| |
| This zeroes the coverage counters, but does nothing else. |
| |
| make coverage-baseline |
| |
| This captures baseline coverage information. |
| |
| make coverage-report |
| |
| This creates the coverage report. |
| |
| make coverage-clean-report |
| |
| This removes the generated coverage report without cleaning the cover- |
| age data itself. |
| |
| make coverage-clean-data |
| |
| This removes the captured coverage data without removing the coverage |
| files created at compile time (*.gcno). |
| |
| make coverage-clean |
| |
| This cleans all coverage data including the generated coverage report. |
| For more information about code coverage, see the gcov and lcov docu- |
| mentation. |
| |
| |
| DISABLING THE Z AND T FORMATTING MODIFIERS |
| |
| The C99 standard defines formatting modifiers z and t for size_t and |
| ptrdiff_t values, respectively. By default, PCRE2 uses these modifiers |
| in environments other than old versions of Microsoft Visual Studio when |
| __STDC_VERSION__ is defined and has a value greater than or equal to |
| 199901L (indicating support for C99). However, there is at least one |
| environment that claims to be C99 but does not support these modifiers. |
| If |
| |
| --disable-percent-zt |
| |
| is specified, no use is made of the z or t modifiers. Instead of %td or |
| %zu, a suitable format is used depending in the size of long for the |
| platform. |
| |
| |
| SUPPORT FOR FUZZERS |
| |
| There is a special option for use by people who want to run fuzzing |
| tests on PCRE2: |
| |
| --enable-fuzz-support |
| |
| At present this applies only to the 8-bit library. If set, it causes an |
| extra library called libpcre2-fuzzsupport.a to be built, but not in- |
| stalled. This contains a single function called LLVMFuzzerTestOneIn- |
| put() whose arguments are a pointer to a string and the length of the |
| string. When called, this function tries to compile the string as a |
| pattern, and if that succeeds, to match it. This is done both with no |
| options and with some random options bits that are generated from the |
| string. |
| |
| Setting --enable-fuzz-support also causes a binary called pcre2fuz- |
| zcheck to be created. This is normally run under valgrind or used when |
| PCRE2 is compiled with address sanitizing enabled. It calls the fuzzing |
| function and outputs information about what it is doing. The input |
| strings are specified by arguments: if an argument starts with "=" the |
| rest of it is a literal input string. Otherwise, it is assumed to be a |
| file name, and the contents of the file are the test string. |
| |
| |
| OBSOLETE OPTION |
| |
| In versions of PCRE2 prior to 10.30, there were two ways of handling |
| backtracking in the pcre2_match() function. The default was to use the |
| system stack, but if |
| |
| --disable-stack-for-recursion |
| |
| was set, memory on the heap was used. From release 10.30 onwards this |
| has changed (the stack is no longer used) and this option now does |
| nothing except give a warning. |
| |
| |
| SEE ALSO |
| |
| pcre2api(3), pcre2-config(3). |
| |
| |
| AUTHOR |
| |
| Philip Hazel |
| Retired from University Computing Service |
| Cambridge, England. |
| |
| |
| REVISION |
| |
| Last updated: 27 July 2022 |
| Copyright (c) 1997-2022 University of Cambridge. |
| ------------------------------------------------------------------------------ |
| |
| |
| PCRE2CALLOUT(3) Library Functions Manual PCRE2CALLOUT(3) |
| |
| |
| |
| NAME |
| PCRE2 - Perl-compatible regular expressions (revised API) |
| |
| SYNOPSIS |
| |
| #include <pcre2.h> |
| |
| int (*pcre2_callout)(pcre2_callout_block *, void *); |
| |
| int pcre2_callout_enumerate(const pcre2_code *code, |
| int (*callback)(pcre2_callout_enumerate_block *, void *), |
| void *user_data); |
| |
| |
| DESCRIPTION |
| |
| PCRE2 provides a feature called "callout", which is a means of tempo- |
| rarily passing control to the caller of PCRE2 in the middle of pattern |
| matching. The caller of PCRE2 provides an external function by putting |
| its entry point in a match context (see pcre2_set_callout() in the |
| pcre2api documentation). |
| |
| When using the pcre2_substitute() function, an additional callout fea- |
| ture is available. This does a callout after each change to the subject |
| string and is described in the pcre2api documentation; the rest of this |
| document is concerned with callouts during pattern matching. |
| |
| Within a regular expression, (?C<arg>) indicates a point at which the |
| external function is to be called. Different callout points can be |
| identified by putting a number less than 256 after the letter C. The |
| default value is zero. Alternatively, the argument may be a delimited |
| string. The starting delimiter must be one of ` ' " ^ % # $ { and the |
| ending delimiter is the same as the start, except for {, where the end- |
| ing delimiter is }. If the ending delimiter is needed within the |
| string, it must be doubled. For example, this pattern has two callout |
| points: |
| |
| (?C1)abc(?C"some ""arbitrary"" text")def |
| |
| If the PCRE2_AUTO_CALLOUT option bit is set when a pattern is compiled, |
| PCRE2 automatically inserts callouts, all with number 255, before each |
| item in the pattern except for immediately before or after an explicit |
| callout. For example, if PCRE2_AUTO_CALLOUT is used with the pattern |
| |
| A(?C3)B |
| |
| it is processed as if it were |
| |
| (?C255)A(?C3)B(?C255) |
| |
| Here is a more complicated example: |
| |
| A(\d{2}|--) |
| |
| With PCRE2_AUTO_CALLOUT, this pattern is processed as if it were |
| |
| (?C255)A(?C255)((?C255)\d{2}(?C255)|(?C255)-(?C255)-(?C255))(?C255) |
| |
| Notice that there is a callout before and after each parenthesis and |
| alternation bar. If the pattern contains a conditional group whose con- |
| dition is an assertion, an automatic callout is inserted immediately |
| before the condition. Such a callout may also be inserted explicitly, |
| for example: |
| |
| (?(?C9)(?=a)ab|de) (?(?C%text%)(?!=d)ab|de) |
| |
| This applies only to assertion conditions (because they are themselves |
| independent groups). |
| |
| Callouts can be useful for tracking the progress of pattern matching. |
| The pcre2test program has a pattern qualifier (/auto_callout) that sets |
| automatic callouts. When any callouts are present, the output from |
| pcre2test indicates how the pattern is being matched. This is useful |
| information when you are trying to optimize the performance of a par- |
| ticular pattern. |
| |
| |
| MISSING CALLOUTS |
| |
| You should be aware that, because of optimizations in the way PCRE2 |
| compiles and matches patterns, callouts sometimes do not happen exactly |
| as you might expect. |
| |
| Auto-possessification |
| |
| At compile time, PCRE2 "auto-possessifies" repeated items when it knows |
| that what follows cannot be part of the repeat. For example, a+[bc] is |
| compiled as if it were a++[bc]. The pcre2test output when this pattern |
| is compiled with PCRE2_ANCHORED and PCRE2_AUTO_CALLOUT and then applied |
| to the string "aaaa" is: |
| |
| --->aaaa |
| +0 ^ a+ |
| +2 ^ ^ [bc] |
| No match |
| |
| This indicates that when matching [bc] fails, there is no backtracking |
| into a+ (because it is being treated as a++) and therefore the callouts |
| that would be taken for the backtracks do not occur. You can disable |
| the auto-possessify feature by passing PCRE2_NO_AUTO_POSSESS to |
| pcre2_compile(), or starting the pattern with (*NO_AUTO_POSSESS). In |
| this case, the output changes to this: |
| |
| --->aaaa |
| +0 ^ a+ |
| +2 ^ ^ [bc] |
| +2 ^ ^ [bc] |
| +2 ^ ^ [bc] |
| +2 ^^ [bc] |
| No match |
| |
| This time, when matching [bc] fails, the matcher backtracks into a+ and |
| tries again, repeatedly, until a+ itself fails. |
| |
| Automatic .* anchoring |
| |
| By default, an optimization is applied when .* is the first significant |
| item in a pattern. If PCRE2_DOTALL is set, so that the dot can match |
| any character, the pattern is automatically anchored. If PCRE2_DOTALL |
| is not set, a match can start only after an internal newline or at the |
| beginning of the subject, and pcre2_compile() remembers this. If a pat- |
| tern has more than one top-level branch, automatic anchoring occurs if |
| all branches are anchorable. |
| |
| This optimization is disabled, however, if .* is in an atomic group or |
| if there is a backreference to the capture group in which it appears. |
| It is also disabled if the pattern contains (*PRUNE) or (*SKIP). How- |
| ever, the presence of callouts does not affect it. |
| |
| For example, if the pattern .*\d is compiled with PCRE2_AUTO_CALLOUT |
| and applied to the string "aa", the pcre2test output is: |
| |
| --->aa |
| +0 ^ .* |
| +2 ^ ^ \d |
| +2 ^^ \d |
| +2 ^ \d |
| No match |
| |
| This shows that all match attempts start at the beginning of the sub- |
| ject. In other words, the pattern is anchored. You can disable this op- |
| timization by passing PCRE2_NO_DOTSTAR_ANCHOR to pcre2_compile(), or |
| starting the pattern with (*NO_DOTSTAR_ANCHOR). In this case, the out- |
| put changes to: |
| |
| --->aa |
| +0 ^ .* |
| +2 ^ ^ \d |
| +2 ^^ \d |
| +2 ^ \d |
| +0 ^ .* |
| +2 ^^ \d |
| +2 ^ \d |
| No match |
| |
| This shows more match attempts, starting at the second subject charac- |
| ter. Another optimization, described in the next section, means that |
| there is no subsequent attempt to match with an empty subject. |
| |
| Other optimizations |
| |
| Other optimizations that provide fast "no match" results also affect |
| callouts. For example, if the pattern is |
| |
| ab(?C4)cd |
| |
| PCRE2 knows that any matching string must contain the letter "d". If |
| the subject string is "abyz", the lack of "d" means that matching |
| doesn't ever start, and the callout is never reached. However, with |
| "abyd", though the result is still no match, the callout is obeyed. |
| |
| For most patterns PCRE2 also knows the minimum length of a matching |
| string, and will immediately give a "no match" return without actually |
| running a match if the subject is not long enough, or, for unanchored |
| patterns, if it has been scanned far enough. |
| |
| You can disable these optimizations by passing the PCRE2_NO_START_OPTI- |
| MIZE option to pcre2_compile(), or by starting the pattern with |
| (*NO_START_OPT). This slows down the matching process, but does ensure |
| that callouts such as the example above are obeyed. |
| |
| |
| THE CALLOUT INTERFACE |
| |
| During matching, when PCRE2 reaches a callout point, if an external |
| function is provided in the match context, it is called. This applies |
| to both normal, DFA, and JIT matching. The first argument to the call- |
| out function is a pointer to a pcre2_callout block. The second argument |
| is the void * callout data that was supplied when the callout was set |
| up by calling pcre2_set_callout() (see the pcre2api documentation). The |
| callout block structure contains the following fields, not necessarily |
| in this order: |
| |
| uint32_t version; |
| uint32_t callout_number; |
| uint32_t capture_top; |
| uint32_t capture_last; |
| uint32_t callout_flags; |
| PCRE2_SIZE *offset_vector; |
| PCRE2_SPTR mark; |
| PCRE2_SPTR subject; |
| PCRE2_SIZE subject_length; |
| PCRE2_SIZE start_match; |
| PCRE2_SIZE current_position; |
| PCRE2_SIZE pattern_position; |
| PCRE2_SIZE next_item_length; |
| PCRE2_SIZE callout_string_offset; |
| PCRE2_SIZE callout_string_length; |
| PCRE2_SPTR callout_string; |
| |
| The version field contains the version number of the block format. The |
| current version is 2; the three callout string fields were added for |
| version 1, and the callout_flags field for version 2. If you are writ- |
| ing an application that might use an earlier release of PCRE2, you |
| should check the version number before accessing any of these fields. |
| The version number will increase in future if more fields are added, |
| but the intention is never to remove any of the existing fields. |
| |
| Fields for numerical callouts |
| |
| For a numerical callout, callout_string is NULL, and callout_number |
| contains the number of the callout, in the range 0-255. This is the |
| number that follows (?C for callouts that part of the pattern; it is |
| 255 for automatically generated callouts. |
| |
| Fields for string callouts |
| |
| For callouts with string arguments, callout_number is always zero, and |
| callout_string points to the string that is contained within the com- |
| piled pattern. Its length is given by callout_string_length. Duplicated |
| ending delimiters that were present in the original pattern string have |
| been turned into single characters, but there is no other processing of |
| the callout string argument. An additional code unit containing binary |
| zero is present after the string, but is not included in the length. |
| The delimiter that was used to start the string is also stored within |
| the pattern, immediately before the string itself. You can access this |
| delimiter as callout_string[-1] if you need it. |
| |
| The callout_string_offset field is the code unit offset to the start of |
| the callout argument string within the original pattern string. This is |
| provided for the benefit of applications such as script languages that |
| might need to report errors in the callout string within the pattern. |
| |
| Fields for all callouts |
| |
| The remaining fields in the callout block are the same for both kinds |
| of callout. |
| |
| The offset_vector field is a pointer to a vector of capturing offsets |
| (the "ovector"). You may read the elements in this vector, but you must |
| not change any of them. |
| |
| For calls to pcre2_match(), the offset_vector field is not (since re- |
| lease 10.30) a pointer to the actual ovector that was passed to the |
| matching function in the match data block. Instead it points to an in- |
| ternal ovector of a size large enough to hold all possible captured |
| substrings in the pattern. Note that whenever a recursion or subroutine |
| call within a pattern completes, the capturing state is reset to what |
| it was before. |
| |
| The capture_last field contains the number of the most recently cap- |
| tured substring, and the capture_top field contains one more than the |
| number of the highest numbered captured substring so far. If no sub- |
| strings have yet been captured, the value of capture_last is 0 and the |
| value of capture_top is 1. The values of these fields do not always |
| differ by one; for example, when the callout in the pattern |
| ((a)(b))(?C2) is taken, capture_last is 1 but capture_top is 4. |
| |
| The contents of ovector[2] to ovector[<capture_top>*2-1] can be in- |
| spected in order to extract substrings that have been matched so far, |
| in the same way as extracting substrings after a match has completed. |
| The values in ovector[0] and ovector[1] are always PCRE2_UNSET because |
| the match is by definition not complete. Substrings that have not been |
| captured but whose numbers are less than capture_top also have both of |
| their ovector slots set to PCRE2_UNSET. |
| |
| For DFA matching, the offset_vector field points to the ovector that |
| was passed to the matching function in the match data block for call- |
| outs at the top level, but to an internal ovector during the processing |
| of pattern recursions, lookarounds, and atomic groups. However, these |
| ovectors hold no useful information because pcre2_dfa_match() does not |
| support substring capturing. The value of capture_top is always 1 and |
| the value of capture_last is always 0 for DFA matching. |
| |
| The subject and subject_length fields contain copies of the values that |
| were passed to the matching function. |
| |
| The start_match field normally contains the offset within the subject |
| at which the current match attempt started. However, if the escape se- |
| quence \K has been encountered, this value is changed to reflect the |
| modified starting point. If the pattern is not anchored, the callout |
| function may be called several times from the same point in the pattern |
| for different starting points in the subject. |
| |
| The current_position field contains the offset within the subject of |
| the current match pointer. |
| |
| The pattern_position field contains the offset in the pattern string to |
| the next item to be matched. |
| |
| The next_item_length field contains the length of the next item to be |
| processed in the pattern string. When the callout is at the end of the |
| pattern, the length is zero. When the callout precedes an opening |
| parenthesis, the length includes meta characters that follow the paren- |
| thesis. For example, in a callout before an assertion such as (?=ab) |
| the length is 3. For an an alternation bar or a closing parenthesis, |
| the length is one, unless a closing parenthesis is followed by a quan- |
| tifier, in which case its length is included. (This changed in release |
| 10.23. In earlier releases, before an opening parenthesis the length |
| was that of the entire group, and before an alternation bar or a clos- |
| ing parenthesis the length was zero.) |
| |
| The pattern_position and next_item_length fields are intended to help |
| in distinguishing between different automatic callouts, which all have |
| the same callout number. However, they are set for all callouts, and |
| are used by pcre2test to show the next item to be matched when display- |
| ing callout information. |
| |
| In callouts from pcre2_match() the mark field contains a pointer to the |
| zero-terminated name of the most recently passed (*MARK), (*PRUNE), or |
| (*THEN) item in the match, or NULL if no such items have been passed. |
| Instances of (*PRUNE) or (*THEN) without a name do not obliterate a |
| previous (*MARK). In callouts from the DFA matching function this field |
| always contains NULL. |
| |
| The callout_flags field is always zero in callouts from |
| pcre2_dfa_match() or when JIT is being used. When pcre2_match() without |
| JIT is used, the following bits may be set: |
| |
| PCRE2_CALLOUT_STARTMATCH |
| |
| This is set for the first callout after the start of matching for each |
| new starting position in the subject. |
| |
| PCRE2_CALLOUT_BACKTRACK |
| |
| This is set if there has been a matching backtrack since the previous |
| callout, or since the start of matching if this is the first callout |
| from a pcre2_match() run. |
| |
| Both bits are set when a backtrack has caused a "bumpalong" to a new |
| starting position in the subject. Output from pcre2test does not indi- |
| cate the presence of these bits unless the callout_extra modifier is |
| set. |
| |
| The information in the callout_flags field is provided so that applica- |
| tions can track and tell their users how matching with backtracking is |
| done. This can be useful when trying to optimize patterns, or just to |
| understand how PCRE2 works. There is no support in pcre2_dfa_match() |
| because there is no backtracking in DFA matching, and there is no sup- |
| port in JIT because JIT is all about maximimizing matching performance. |
| In both these cases the callout_flags field is always zero. |
| |
| |
| RETURN VALUES FROM CALLOUTS |
| |
| The external callout function returns an integer to PCRE2. If the value |
| is zero, matching proceeds as normal. If the value is greater than |
| zero, matching fails at the current point, but the testing of other |
| matching possibilities goes ahead, just as if a lookahead assertion had |
| failed. If the value is less than zero, the match is abandoned, and the |
| matching function returns the negative value. |
| |
| Negative values should normally be chosen from the set of PCRE2_ER- |
| ROR_xxx values. In particular, PCRE2_ERROR_NOMATCH forces a standard |
| "no match" failure. The error number PCRE2_ERROR_CALLOUT is reserved |
| for use by callout functions; it will never be used by PCRE2 itself. |
| |
| |
| CALLOUT ENUMERATION |
| |
| int pcre2_callout_enumerate(const pcre2_code *code, |
| int (*callback)(pcre2_callout_enumerate_block *, void *), |
| void *user_data); |
| |
| A script language that supports the use of string arguments in callouts |
| might like to scan all the callouts in a pattern before running the |
| match. This can be done by calling pcre2_callout_enumerate(). The first |
| argument is a pointer to a compiled pattern, the second points to a |
| callback function, and the third is arbitrary user data. The callback |
| function is called for every callout in the pattern in the order in |
| which they appear. Its first argument is a pointer to a callout enumer- |
| ation block, and its second argument is the user_data value that was |
| passed to pcre2_callout_enumerate(). The data block contains the fol- |
| lowing fields: |
| |
| version Block version number |
| pattern_position Offset to next item in pattern |
| next_item_length Length of next item in pattern |
| callout_number Number for numbered callouts |
| callout_string_offset Offset to string within pattern |
| callout_string_length Length of callout string |
| callout_string Points to callout string or is NULL |
| |
| The version number is currently 0. It will increase if new fields are |
| ever added to the block. The remaining fields are the same as their |
| namesakes in the pcre2_callout block that is used for callouts during |
| matching, as described above. |
| |
| Note that the value of pattern_position is unique for each callout. |
| However, if a callout occurs inside a group that is quantified with a |
| non-zero minimum or a fixed maximum, the group is replicated inside the |
| compiled pattern. For example, a pattern such as /(a){2}/ is compiled |
| as if it were /(a)(a)/. This means that the callout will be enumerated |
| more than once, but with the same value for pattern_position in each |
| case. |
| |
| The callback function should normally return zero. If it returns a non- |
| zero value, scanning the pattern stops, and that value is returned from |
| pcre2_callout_enumerate(). |
| |
| |
| AUTHOR |
| |
| Philip Hazel |
| University Computing Service |
| Cambridge, England. |
| |
| |
| REVISION |
| |
| Last updated: 03 February 2019 |
| Copyright (c) 1997-2019 University of Cambridge. |
| ------------------------------------------------------------------------------ |
| |
| |
| PCRE2COMPAT(3) Library Functions Manual PCRE2COMPAT(3) |
| |
| |
| |
| NAME |
| PCRE2 - Perl-compatible regular expressions (revised API) |
| |
| DIFFERENCES BETWEEN PCRE2 AND PERL |
| |
| This document describes some of the differences in the ways that PCRE2 |
| and Perl handle regular expressions. The differences described here are |
| with respect to Perl version 5.34.0, but as both Perl and PCRE2 are |
| continually changing, the information may at times be out of date. |
| |
| 1. When PCRE2_DOTALL (equivalent to Perl's /s qualifier) is not set, |
| the behaviour of the '.' metacharacter differs from Perl. In PCRE2, '.' |
| matches the next character unless it is the start of a newline se- |
| quence. This means that, if the newline setting is CR, CRLF, or NUL, |
| '.' will match the code point LF (0x0A) in ASCII/Unicode environments, |
| and NL (either 0x15 or 0x25) when using EBCDIC. In Perl, '.' appears |
| never to match LF, even when 0x0A is not a newline indicator. |
| |
| 2. PCRE2 has only a subset of Perl's Unicode support. Details of what |
| it does have are given in the pcre2unicode page. |
| |
| 3. Like Perl, PCRE2 allows repeat quantifiers on parenthesized asser- |
| tions, but they do not mean what you might think. For example, (?!a){3} |
| does not assert that the next three characters are not "a". It just as- |
| serts that the next character is not "a" three times (in principle; |
| PCRE2 optimizes this to run the assertion just once). Perl allows some |
| repeat quantifiers on other assertions, for example, \b* , but these do |
| not seem to have any use. PCRE2 does not allow any kind of quantifier |
| on non-lookaround assertions. |
| |
| 4. Capture groups that occur inside negative lookaround assertions are |
| counted, but their entries in the offsets vector are set only when a |
| negative assertion is a condition that has a matching branch (that is, |
| the condition is false). Perl may set such capture groups in other |
| circumstances. |
| |
| 5. The following Perl escape sequences are not supported: \F, \l, \L, |
| \u, \U, and \N when followed by a character name. \N on its own, match- |
| ing a non-newline character, and \N{U+dd..}, matching a Unicode code |
| point, are supported. The escapes that modify the case of following |
| letters are implemented by Perl's general string-handling and are not |
| part of its pattern matching engine. If any of these are encountered by |
| PCRE2, an error is generated by default. However, if either of the |
| PCRE2_ALT_BSUX or PCRE2_EXTRA_ALT_BSUX options is set, \U and \u are |
| interpreted as ECMAScript interprets them. |
| |
| 6. The Perl escape sequences \p, \P, and \X are supported only if PCRE2 |
| is built with Unicode support (the default). The properties that can be |
| tested with \p and \P are limited to the general category properties |
| such as Lu and Nd, script names such as Greek or Han, Bidi_Class, |
| Bidi_Control, and the derived properties Any and LC (synonym L&). Both |
| PCRE2 and Perl support the Cs (surrogate) property, but in PCRE2 its |
| use is limited. See the pcre2pattern documentation for details. The |
| long synonyms for property names that Perl supports (such as \p{Let- |
| ter}) are not supported by PCRE2, nor is it permitted to prefix any of |
| these properties with "Is". |
| |
| 7. PCRE2 supports the \Q...\E escape for quoting substrings. Characters |
| in between are treated as literals. However, this is slightly different |
| from Perl in that $ and @ are also handled as literals inside the |
| quotes. In Perl, they cause variable interpolation (PCRE2 does not have |
| variables). Also, Perl does "double-quotish backslash interpolation" on |
| any backslashes between \Q and \E which, its documentation says, "may |
| lead to confusing results". PCRE2 treats a backslash between \Q and \E |
| just like any other character. Note the following examples: |
| |
| Pattern PCRE2 matches Perl matches |
| |
| \Qabc$xyz\E abc$xyz abc followed by the |
| contents of $xyz |
| \Qabc\$xyz\E abc\$xyz abc\$xyz |
| \Qabc\E\$\Qxyz\E abc$xyz abc$xyz |
| \QA\B\E A\B A\B |
| \Q\\E \ \\E |
| |
| The \Q...\E sequence is recognized both inside and outside character |
| classes by both PCRE2 and Perl. |
| |
| 8. Fairly obviously, PCRE2 does not support the (?{code}) and |
| (??{code}) constructions. However, PCRE2 does have a "callout" feature, |
| which allows an external function to be called during pattern matching. |
| See the pcre2callout documentation for details. |
| |
| 9. Subroutine calls (whether recursive or not) were treated as atomic |
| groups up to PCRE2 release 10.23, but from release 10.30 this changed, |
| and backtracking into subroutine calls is now supported, as in Perl. |
| |
| 10. In PCRE2, if any of the backtracking control verbs are used in a |
| group that is called as a subroutine (whether or not recursively), |
| their effect is confined to that group; it does not extend to the sur- |
| rounding pattern. This is not always the case in Perl. In particular, |
| if (*THEN) is present in a group that is called as a subroutine, its |
| action is limited to that group, even if the group does not contain any |
| | characters. Note that such groups are processed as anchored at the |
| point where they are tested. |
| |
| 11. If a pattern contains more than one backtracking control verb, the |
| first one that is backtracked onto acts. For example, in the pattern |
| A(*COMMIT)B(*PRUNE)C a failure in B triggers (*COMMIT), but a failure |
| in C triggers (*PRUNE). Perl's behaviour is more complex; in many cases |
| it is the same as PCRE2, but there are cases where it differs. |
| |
| 12. There are some differences that are concerned with the settings of |
| captured strings when part of a pattern is repeated. For example, |
| matching "aba" against the pattern /^(a(b)?)+$/ in Perl leaves $2 un- |
| set, but in PCRE2 it is set to "b". |
| |
| 13. PCRE2's handling of duplicate capture group numbers and names is |
| not as general as Perl's. This is a consequence of the fact the PCRE2 |
| works internally just with numbers, using an external table to trans- |
| late between numbers and names. In particular, a pattern such as |
| (?|(?<a>A)|(?<b>B)), where the two capture groups have the same number |
| but different names, is not supported, and causes an error at compile |
| time. If it were allowed, it would not be possible to distinguish which |
| group matched, because both names map to capture group number 1. To |
| avoid this confusing situation, an error is given at compile time. |
| |
| 14. Perl used to recognize comments in some places that PCRE2 does not, |
| for example, between the ( and ? at the start of a group. If the /x |
| modifier is set, Perl allowed white space between ( and ? though the |
| latest Perls give an error (for a while it was just deprecated). There |
| may still be some cases where Perl behaves differently. |
| |
| 15. Perl, when in warning mode, gives warnings for character classes |
| such as [A-\d] or [a-[:digit:]]. It then treats the hyphens as liter- |
| als. PCRE2 has no warning features, so it gives an error in these cases |
| because they are almost certainly user mistakes. |
| |
| 16. In PCRE2, the upper/lower case character properties Lu and Ll are |
| not affected when case-independent matching is specified. For example, |
| \p{Lu} always matches an upper case letter. I think Perl has changed in |
| this respect; in the release at the time of writing (5.34), \p{Lu} and |
| \p{Ll} match all letters, regardless of case, when case independence is |
| specified. |
| |
| 17. From release 5.32.0, Perl locks out the use of \K in lookaround as- |
| sertions. From release 10.38 PCRE2 does the same by default. However, |
| there is an option for re-enabling the previous behaviour. When this |
| option is set, \K is acted on when it occurs in positive assertions, |
| but is ignored in negative assertions. |
| |
| 18. PCRE2 provides some extensions to the Perl regular expression fa- |
| cilities. Perl 5.10 included new features that were not in earlier |
| versions of Perl, some of which (such as named parentheses) were in |
| PCRE2 for some time before. This list is with respect to Perl 5.34: |
| |
| (a) Although lookbehind assertions in PCRE2 must match fixed length |
| strings, each alternative toplevel branch of a lookbehind assertion can |
| match a different length of string. Perl used to require them all to |
| have the same length, but the latest version has some variable length |
| support. |
| |
| (b) From PCRE2 10.23, backreferences to groups of fixed length are sup- |
| ported in lookbehinds, provided that there is no possibility of refer- |
| encing a non-unique number or name. Perl does not support backrefer- |
| ences in lookbehinds. |
| |
| (c) If PCRE2_DOLLAR_ENDONLY is set and PCRE2_MULTILINE is not set, the |
| $ meta-character matches only at the very end of the string. |
| |
| (d) A backslash followed by a letter with no special meaning is |
| faulted. (Perl can be made to issue a warning.) |
| |
| (e) If PCRE2_UNGREEDY is set, the greediness of the repetition quanti- |
| fiers is inverted, that is, by default they are not greedy, but if fol- |
| lowed by a question mark they are. |
| |
| (f) PCRE2_ANCHORED can be used at matching time to force a pattern to |
| be tried only at the first matching position in the subject string. |
| |
| (g) The PCRE2_NOTBOL, PCRE2_NOTEOL, PCRE2_NOTEMPTY and |
| PCRE2_NOTEMPTY_ATSTART options have no Perl equivalents. |
| |
| (h) The \R escape sequence can be restricted to match only CR, LF, or |
| CRLF by the PCRE2_BSR_ANYCRLF option. |
| |
| (i) The callout facility is PCRE2-specific. Perl supports codeblocks |
| and variable interpolation, but not general hooks on every match. |
| |
| (j) The partial matching facility is PCRE2-specific. |
| |
| (k) The alternative matching function (pcre2_dfa_match() matches in a |
| different way and is not Perl-compatible. |
| |
| (l) PCRE2 recognizes some special sequences such as (*CR) or (*NO_JIT) |
| at the start of a pattern. These set overall options that cannot be |
| changed within the pattern. |
| |
| (m) PCRE2 supports non-atomic positive lookaround assertions. This is |
| an extension to the lookaround facilities. The default, Perl-compatible |
| lookarounds are atomic. |
| |
| 19. The Perl /a modifier restricts /d numbers to pure ascii, and the |
| /aa modifier restricts /i case-insensitive matching to pure ascii, ig- |
| noring Unicode rules. This separation cannot be represented with |
| PCRE2_UCP. |
| |
| 20. Perl has different limits than PCRE2. See the pcre2limit documenta- |
| tion for details. Perl went with 5.10 from recursion to iteration keep- |
| ing the intermediate matches on the heap, which is ~10% slower but does |
| not fall into any stack-overflow limit. PCRE2 made a similar change at |
| release 10.30, and also has many build-time and run-time customizable |
| limits. |
| |
| |
| AUTHOR |
| |
| Philip Hazel |
| Retired from University Computing Service |
| Cambridge, England. |
| |
| |
| REVISION |
| |
| Last updated: 08 December 2021 |
| Copyright (c) 1997-2021 University of Cambridge. |
| ------------------------------------------------------------------------------ |
| |
| |
| PCRE2JIT(3) Library Functions Manual PCRE2JIT(3) |
| |
| |
| |
| NAME |
| PCRE2 - Perl-compatible regular expressions (revised API) |
| |
| PCRE2 JUST-IN-TIME COMPILER SUPPORT |
| |
| Just-in-time compiling is a heavyweight optimization that can greatly |
| speed up pattern matching. However, it comes at the cost of extra pro- |
| cessing before the match is performed, so it is of most benefit when |
| the same pattern is going to be matched many times. This does not nec- |
| essarily mean many calls of a matching function; if the pattern is not |
| anchored, matching attempts may take place many times at various posi- |
| tions in the subject, even for a single call. Therefore, if the subject |
| string is very long, it may still pay to use JIT even for one-off |
| matches. JIT support is available for all of the 8-bit, 16-bit and |
| 32-bit PCRE2 libraries. |
| |
| JIT support applies only to the traditional Perl-compatible matching |
| function. It does not apply when the DFA matching function is being |
| used. The code for this support was written by Zoltan Herczeg. |
| |
| |
| AVAILABILITY OF JIT SUPPORT |
| |
| JIT support is an optional feature of PCRE2. The "configure" option |
| --enable-jit (or equivalent CMake option) must be set when PCRE2 is |
| built if you want to use JIT. The support is limited to the following |
| hardware platforms: |
| |
| ARM 32-bit (v5, v7, and Thumb2) |
| ARM 64-bit |
| IBM s390x 64 bit |
| Intel x86 32-bit and 64-bit |
| MIPS 32-bit and 64-bit |
| Power PC 32-bit and 64-bit |
| SPARC 32-bit |
| |
| If --enable-jit is set on an unsupported platform, compilation fails. |
| |
| A program can tell if JIT support is available by calling pcre2_con- |
| fig() with the PCRE2_CONFIG_JIT option. The result is 1 when JIT is |
| available, and 0 otherwise. However, a simple program does not need to |
| check this in order to use JIT. The API is implemented in a way that |
| falls back to the interpretive code if JIT is not available. For pro- |
| grams that need the best possible performance, there is also a "fast |
| path" API that is JIT-specific. |
| |
| |
| SIMPLE USE OF JIT |
| |
| To make use of the JIT support in the simplest way, all you have to do |
| is to call pcre2_jit_compile() after successfully compiling a pattern |
| with pcre2_compile(). This function has two arguments: the first is the |
| compiled pattern pointer that was returned by pcre2_compile(), and the |
| second is zero or more of the following option bits: PCRE2_JIT_COM- |
| PLETE, PCRE2_JIT_PARTIAL_HARD, or PCRE2_JIT_PARTIAL_SOFT. |
| |
| If JIT support is not available, a call to pcre2_jit_compile() does |
| nothing and returns PCRE2_ERROR_JIT_BADOPTION. Otherwise, the compiled |
| pattern is passed to the JIT compiler, which turns it into machine code |
| that executes much faster than the normal interpretive code, but yields |
| exactly the same results. The returned value from pcre2_jit_compile() |
| is zero on success, or a negative error code. |
| |
| There is a limit to the size of pattern that JIT supports, imposed by |
| the size of machine stack that it uses. The exact rules are not docu- |
| mented because they may change at any time, in particular, when new op- |
| timizations are introduced. If a pattern is too big, a call to |
| pcre2_jit_compile() returns PCRE2_ERROR_NOMEMORY. |
| |
| PCRE2_JIT_COMPLETE requests the JIT compiler to generate code for com- |
| plete matches. If you want to run partial matches using the PCRE2_PAR- |
| TIAL_HARD or PCRE2_PARTIAL_SOFT options of pcre2_match(), you should |
| set one or both of the other options as well as, or instead of |
| PCRE2_JIT_COMPLETE. The JIT compiler generates different optimized code |
| for each of the three modes (normal, soft partial, hard partial). When |
| pcre2_match() is called, the appropriate code is run if it is avail- |
| able. Otherwise, the pattern is matched using interpretive code. |
| |
| You can call pcre2_jit_compile() multiple times for the same compiled |
| pattern. It does nothing if it has previously compiled code for any of |
| the option bits. For example, you can call it once with PCRE2_JIT_COM- |
| PLETE and (perhaps later, when you find you need partial matching) |
| again with PCRE2_JIT_COMPLETE and PCRE2_JIT_PARTIAL_HARD. This time it |
| will ignore PCRE2_JIT_COMPLETE and just compile code for partial match- |
| ing. If pcre2_jit_compile() is called with no option bits set, it imme- |
| diately returns zero. This is an alternative way of testing whether JIT |
| is available. |
| |
| At present, it is not possible to free JIT compiled code except when |
| the entire compiled pattern is freed by calling pcre2_code_free(). |
| |
| In some circumstances you may need to call additional functions. These |
| are described in the section entitled "Controlling the JIT stack" be- |
| low. |
| |
| There are some pcre2_match() options that are not supported by JIT, and |
| there are also some pattern items that JIT cannot handle. Details are |
| given below. In both cases, matching automatically falls back to the |
| interpretive code. If you want to know whether JIT was actually used |
| for a particular match, you should arrange for a JIT callback function |
| to be set up as described in the section entitled "Controlling the JIT |
| stack" below, even if you do not need to supply a non-default JIT |
| stack. Such a callback function is called whenever JIT code is about to |
| be obeyed. If the match-time options are not right for JIT execution, |
| the callback function is not obeyed. |
| |
| If the JIT compiler finds an unsupported item, no JIT data is gener- |
| ated. You can find out if JIT matching is available after compiling a |
| pattern by calling pcre2_pattern_info() with the PCRE2_INFO_JITSIZE op- |
| tion. A non-zero result means that JIT compilation was successful. A |
| result of 0 means that JIT support is not available, or the pattern was |
| not processed by pcre2_jit_compile(), or the JIT compiler was not able |
| to handle the pattern. |
| |
| |
| MATCHING SUBJECTS CONTAINING INVALID UTF |
| |
| When a pattern is compiled with the PCRE2_UTF option, subject strings |
| are normally expected to be a valid sequence of UTF code units. By de- |
| fault, this is checked at the start of matching and an error is gener- |
| ated if invalid UTF is detected. The PCRE2_NO_UTF_CHECK option can be |
| passed to pcre2_match() to skip the check (for improved performance) if |
| you are sure that a subject string is valid. If this option is used |
| with an invalid string, the result is undefined. |
| |
| However, a way of running matches on strings that may contain invalid |
| UTF sequences is available. Calling pcre2_compile() with the |
| PCRE2_MATCH_INVALID_UTF option has two effects: it tells the inter- |
| preter in pcre2_match() to support invalid UTF, and, if pcre2_jit_com- |
| pile() is called, the compiled JIT code also supports invalid UTF. De- |
| tails of how this support works, in both the JIT and the interpretive |
| cases, is given in the pcre2unicode documentation. |
| |
| There is also an obsolete option for pcre2_jit_compile() called |
| PCRE2_JIT_INVALID_UTF, which currently exists only for backward compat- |
| ibility. It is superseded by the pcre2_compile() option |
| PCRE2_MATCH_INVALID_UTF and should no longer be used. It may be removed |
| in future. |
| |
| |
| UNSUPPORTED OPTIONS AND PATTERN ITEMS |
| |
| The pcre2_match() options that are supported for JIT matching are |
| PCRE2_COPY_MATCHED_SUBJECT, PCRE2_NOTBOL, PCRE2_NOTEOL, PCRE2_NOTEMPTY, |
| PCRE2_NOTEMPTY_ATSTART, PCRE2_NO_UTF_CHECK, PCRE2_PARTIAL_HARD, and |
| PCRE2_PARTIAL_SOFT. The PCRE2_ANCHORED and PCRE2_ENDANCHORED options |
| are not supported at match time. |
| |
| If the PCRE2_NO_JIT option is passed to pcre2_match() it disables the |
| use of JIT, forcing matching by the interpreter code. |
| |
| The only unsupported pattern items are \C (match a single data unit) |
| when running in a UTF mode, and a callout immediately before an asser- |
| tion condition in a conditional group. |
| |
| |
| RETURN VALUES FROM JIT MATCHING |
| |
| When a pattern is matched using JIT matching, the return values are the |
| same as those given by the interpretive pcre2_match() code, with the |
| addition of one new error code: PCRE2_ERROR_JIT_STACKLIMIT. This means |
| that the memory used for the JIT stack was insufficient. See "Control- |
| ling the JIT stack" below for a discussion of JIT stack usage. |
| |
| The error code PCRE2_ERROR_MATCHLIMIT is returned by the JIT code if |
| searching a very large pattern tree goes on for too long, as it is in |
| the same circumstance when JIT is not used, but the details of exactly |
| what is counted are not the same. The PCRE2_ERROR_DEPTHLIMIT error code |
| is never returned when JIT matching is used. |
| |
| |
| CONTROLLING THE JIT STACK |
| |
| When the compiled JIT code runs, it needs a block of memory to use as a |
| stack. By default, it uses 32KiB on the machine stack. However, some |
| large or complicated patterns need more than this. The error PCRE2_ER- |
| ROR_JIT_STACKLIMIT is given when there is not enough stack. Three func- |
| tions are provided for managing blocks of memory for use as JIT stacks. |
| There is further discussion about the use of JIT stacks in the section |
| entitled "JIT stack FAQ" below. |
| |
| The pcre2_jit_stack_create() function creates a JIT stack. Its argu- |
| ments are a starting size, a maximum size, and a general context (for |
| memory allocation functions, or NULL for standard memory allocation). |
| It returns a pointer to an opaque structure of type pcre2_jit_stack, or |
| NULL if there is an error. The pcre2_jit_stack_free() function is used |
| to free a stack that is no longer needed. If its argument is NULL, this |
| function returns immediately, without doing anything. (For the techni- |
| cally minded: the address space is allocated by mmap or VirtualAlloc.) |
| A maximum stack size of 512KiB to 1MiB should be more than enough for |
| any pattern. |
| |
| The pcre2_jit_stack_assign() function specifies which stack JIT code |
| should use. Its arguments are as follows: |
| |
| pcre2_match_context *mcontext |
| pcre2_jit_callback callback |
| void *data |
| |
| The first argument is a pointer to a match context. When this is subse- |
| quently passed to a matching function, its information determines which |
| JIT stack is used. If this argument is NULL, the function returns imme- |
| diately, without doing anything. There are three cases for the values |
| of the other two options: |
| |
| (1) If callback is NULL and data is NULL, an internal 32KiB block |
| on the machine stack is used. This is the default when a match |
| context is created. |
| |
| (2) If callback is NULL and data is not NULL, data must be |
| a pointer to a valid JIT stack, the result of calling |
| pcre2_jit_stack_create(). |
| |
| (3) If callback is not NULL, it must point to a function that is |
| called with data as an argument at the start of matching, in |
| order to set up a JIT stack. If the return from the callback |
| function is NULL, the internal 32KiB stack is used; otherwise the |
| return value must be a valid JIT stack, the result of calling |
| pcre2_jit_stack_create(). |
| |
| A callback function is obeyed whenever JIT code is about to be run; it |
| is not obeyed when pcre2_match() is called with options that are incom- |
| patible for JIT matching. A callback function can therefore be used to |
| determine whether a match operation was executed by JIT or by the in- |
| terpreter. |
| |
| You may safely use the same JIT stack for more than one pattern (either |
| by assigning directly or by callback), as long as the patterns are |
| matched sequentially in the same thread. Currently, the only way to set |
| up non-sequential matches in one thread is to use callouts: if a call- |
| out function starts another match, that match must use a different JIT |
| stack to the one used for currently suspended match(es). |
| |
| In a multithread application, if you do not specify a JIT stack, or if |
| you assign or pass back NULL from a callback, that is thread-safe, be- |
| cause each thread has its own machine stack. However, if you assign or |
| pass back a non-NULL JIT stack, this must be a different stack for each |
| thread so that the application is thread-safe. |
| |
| Strictly speaking, even more is allowed. You can assign the same non- |
| NULL stack to a match context that is used by any number of patterns, |
| as long as they are not used for matching by multiple threads at the |
| same time. For example, you could use the same stack in all compiled |
| patterns, with a global mutex in the callback to wait until the stack |
| is available for use. However, this is an inefficient solution, and not |
| recommended. |
| |
| This is a suggestion for how a multithreaded program that needs to set |
| up non-default JIT stacks might operate: |
| |
| During thread initialization |
| thread_local_var = pcre2_jit_stack_create(...) |
| |
| During thread exit |
| pcre2_jit_stack_free(thread_local_var) |
| |
| Use a one-line callback function |
| return thread_local_var |
| |
| All the functions described in this section do nothing if JIT is not |
| available. |
| |
| |
| JIT STACK FAQ |
| |
| (1) Why do we need JIT stacks? |
| |
| PCRE2 (and JIT) is a recursive, depth-first engine, so it needs a stack |
| where the local data of the current node is pushed before checking its |
| child nodes. Allocating real machine stack on some platforms is diffi- |
| cult. For example, the stack chain needs to be updated every time if we |
| extend the stack on PowerPC. Although it is possible, its updating |
| time overhead decreases performance. So we do the recursion in memory. |
| |
| (2) Why don't we simply allocate blocks of memory with malloc()? |
| |
| Modern operating systems have a nice feature: they can reserve an ad- |
| dress space instead of allocating memory. We can safely allocate memory |
| pages inside this address space, so the stack could grow without moving |
| memory data (this is important because of pointers). Thus we can allo- |
| cate 1MiB address space, and use only a single memory page (usually |
| 4KiB) if that is enough. However, we can still grow up to 1MiB anytime |
| if needed. |
| |
| (3) Who "owns" a JIT stack? |
| |
| The owner of the stack is the user program, not the JIT studied pattern |
| or anything else. The user program must ensure that if a stack is being |
| used by pcre2_match(), (that is, it is assigned to a match context that |
| is passed to the pattern currently running), that stack must not be |
| used by any other threads (to avoid overwriting the same memory area). |
| The best practice for multithreaded programs is to allocate a stack for |
| each thread, and return this stack through the JIT callback function. |
| |
| (4) When should a JIT stack be freed? |
| |
| You can free a JIT stack at any time, as long as it will not be used by |
| pcre2_match() again. When you assign the stack to a match context, only |
| a pointer is set. There is no reference counting or any other magic. |
| You can free compiled patterns, contexts, and stacks in any order, any- |
| time. Just do not call pcre2_match() with a match context pointing to |
| an already freed stack, as that will cause SEGFAULT. (Also, do not free |
| a stack currently used by pcre2_match() in another thread). You can |
| also replace the stack in a context at any time when it is not in use. |
| You should free the previous stack before assigning a replacement. |
| |
| (5) Should I allocate/free a stack every time before/after calling |
| pcre2_match()? |
| |
| No, because this is too costly in terms of resources. However, you |
| could implement some clever idea which release the stack if it is not |
| used in let's say two minutes. The JIT callback can help to achieve |
| this without keeping a list of patterns. |
| |
| (6) OK, the stack is for long term memory allocation. But what happens |
| if a pattern causes stack overflow with a stack of 1MiB? Is that 1MiB |
| kept until the stack is freed? |
| |
| Especially on embedded sytems, it might be a good idea to release mem- |
| ory sometimes without freeing the stack. There is no API for this at |
| the moment. Probably a function call which returns with the currently |
| allocated memory for any stack and another which allows releasing mem- |
| ory (shrinking the stack) would be a good idea if someone needs this. |
| |
| (7) This is too much of a headache. Isn't there any better solution for |
| JIT stack handling? |
| |
| No, thanks to Windows. If POSIX threads were used everywhere, we could |
| throw out this complicated API. |
| |
| |
| FREEING JIT SPECULATIVE MEMORY |
| |
| void pcre2_jit_free_unused_memory(pcre2_general_context *gcontext); |
| |
| The JIT executable allocator does not free all memory when it is possi- |
| ble. It expects new allocations, and keeps some free memory around to |
| improve allocation speed. However, in low memory conditions, it might |
| be better to free all possible memory. You can cause this to happen by |
| calling pcre2_jit_free_unused_memory(). Its argument is a general con- |
| text, for custom memory management, or NULL for standard memory manage- |
| ment. |
| |
| |
| EXAMPLE CODE |
| |
| This is a single-threaded example that specifies a JIT stack without |
| using a callback. A real program should include error checking after |
| all the function calls. |
| |
| int rc; |
| pcre2_code *re; |
| pcre2_match_data *match_data; |
| pcre2_match_context *mcontext; |
| pcre2_jit_stack *jit_stack; |
| |
| re = pcre2_compile(pattern, PCRE2_ZERO_TERMINATED, 0, |
| &errornumber, &erroffset, NULL); |
| rc = pcre2_jit_compile(re, PCRE2_JIT_COMPLETE); |
| mcontext = pcre2_match_context_create(NULL); |
| jit_stack = pcre2_jit_stack_create(32*1024, 512*1024, NULL); |
| pcre2_jit_stack_assign(mcontext, NULL, jit_stack); |
| match_data = pcre2_match_data_create(re, 10); |
| rc = pcre2_match(re, subject, length, 0, 0, match_data, mcontext); |
| /* Process result */ |
| |
| pcre2_code_free(re); |
| pcre2_match_data_free(match_data); |
| pcre2_match_context_free(mcontext); |
| pcre2_jit_stack_free(jit_stack); |
| |
| |
| JIT FAST PATH API |
| |
| Because the API described above falls back to interpreted matching when |
| JIT is not available, it is convenient for programs that are written |
| for general use in many environments. However, calling JIT via |
| pcre2_match() does have a performance impact. Programs that are written |
| for use where JIT is known to be available, and which need the best |
| possible performance, can instead use a "fast path" API to call JIT |
| matching directly instead of calling pcre2_match() (obviously only for |
| patterns that have been successfully processed by pcre2_jit_compile()). |
| |
| The fast path function is called pcre2_jit_match(), and it takes ex- |
| actly the same arguments as pcre2_match(). However, the subject string |
| must be specified with a length; PCRE2_ZERO_TERMINATED is not sup- |
| ported. Unsupported option bits (for example, PCRE2_ANCHORED, PCRE2_EN- |
| DANCHORED and PCRE2_COPY_MATCHED_SUBJECT) are ignored, as is the |
| PCRE2_NO_JIT option. The return values are also the same as for |
| pcre2_match(), plus PCRE2_ERROR_JIT_BADOPTION if a matching mode (par- |
| tial or complete) is requested that was not compiled. |
| |
| When you call pcre2_match(), as well as testing for invalid options, a |
| number of other sanity checks are performed on the arguments. For exam- |
| ple, if the subject pointer is NULL but the length is non-zero, an im- |
| mediate error is given. Also, unless PCRE2_NO_UTF_CHECK is set, a UTF |
| subject string is tested for validity. In the interests of speed, these |
| checks do not happen on the JIT fast path, and if invalid data is |
| passed, the result is undefined. |
| |
| Bypassing the sanity checks and the pcre2_match() wrapping can give |
| speedups of more than 10%. |
| |
| |
| SEE ALSO |
| |
| pcre2api(3) |
| |
| |
| AUTHOR |
| |
| Philip Hazel (FAQ by Zoltan Herczeg) |
| University Computing Service |
| Cambridge, England. |
| |
| |
| REVISION |
| |
| Last updated: 30 November 2021 |
| Copyright (c) 1997-2021 University of Cambridge. |
| ------------------------------------------------------------------------------ |
| |
| |
| PCRE2LIMITS(3) Library Functions Manual PCRE2LIMITS(3) |
| |
| |
| |
| NAME |
| PCRE2 - Perl-compatible regular expressions (revised API) |
| |
| SIZE AND OTHER LIMITATIONS |
| |
| There are some size limitations in PCRE2 but it is hoped that they will |
| never in practice be relevant. |
| |
| The maximum size of a compiled pattern is approximately 64 thousand |
| code units for the 8-bit and 16-bit libraries if PCRE2 is compiled with |
| the default internal linkage size, which is 2 bytes for these li- |
| braries. If you want to process regular expressions that are truly |
| enormous, you can compile PCRE2 with an internal linkage size of 3 or 4 |
| (when building the 16-bit library, 3 is rounded up to 4). See the |
| README file in the source distribution and the pcre2build documentation |
| for details. In these cases the limit is substantially larger. How- |
| ever, the speed of execution is slower. In the 32-bit library, the in- |
| ternal linkage size is always 4. |
| |
| The maximum length of a source pattern string is essentially unlimited; |
| it is the largest number a PCRE2_SIZE variable can hold. However, the |
| program that calls pcre2_compile() can specify a smaller limit. |
| |
| The maximum length (in code units) of a subject string is one less than |
| the largest number a PCRE2_SIZE variable can hold. PCRE2_SIZE is an un- |
| signed integer type, usually defined as size_t. Its maximum value (that |
| is ~(PCRE2_SIZE)0) is reserved as a special indicator for zero-termi- |
| nated strings and unset offsets. |
| |
| All values in repeating quantifiers must be less than 65536. |
| |
| The maximum length of a lookbehind assertion is 65535 characters. |
| |
| There is no limit to the number of parenthesized groups, but there can |
| be no more than 65535 capture groups, and there is a limit to the depth |
| of nesting of parenthesized subpatterns of all kinds. This is imposed |
| in order to limit the amount of system stack used at compile time. The |
| default limit can be specified when PCRE2 is built; if not, the default |
| is set to 250. An application can change this limit by calling |
| pcre2_set_parens_nest_limit() to set the limit in a compile context. |
| |
| The maximum length of name for a named capture group is 32 code units, |
| and the maximum number of such groups is 10000. |
| |
| The maximum length of a name in a (*MARK), (*PRUNE), (*SKIP), or |
| (*THEN) verb is 255 code units for the 8-bit library and 65535 code |
| units for the 16-bit and 32-bit libraries. |
| |
| The maximum length of a string argument to a callout is the largest |
| number a 32-bit unsigned integer can hold. |
| |
| The maximum amount of heap memory used for matching is controlled by |
| the heap limit, which can be set in a pattern or in a match context. |
| The default is a very large number, effectively unlimited. |
| |
| |
| AUTHOR |
| |
| Philip Hazel |
| Retired from University Computing Service |
| Cambridge, England. |
| |
| |
| REVISION |
| |
| Last updated: 26 July 2022 |
| Copyright (c) 1997-2022 University of Cambridge. |
| ------------------------------------------------------------------------------ |
| |
| |
| PCRE2MATCHING(3) Library Functions Manual PCRE2MATCHING(3) |
| |
| |
| |
| NAME |
| PCRE2 - Perl-compatible regular expressions (revised API) |
| |
| PCRE2 MATCHING ALGORITHMS |
| |
| This document describes the two different algorithms that are available |
| in PCRE2 for matching a compiled regular expression against a given |
| subject string. The "standard" algorithm is the one provided by the |
| pcre2_match() function. This works in the same as as Perl's matching |
| function, and provide a Perl-compatible matching operation. The just- |
| in-time (JIT) optimization that is described in the pcre2jit documenta- |
| tion is compatible with this function. |
| |
| An alternative algorithm is provided by the pcre2_dfa_match() function; |
| it operates in a different way, and is not Perl-compatible. This alter- |
| native has advantages and disadvantages compared with the standard al- |
| gorithm, and these are described below. |
| |
| When there is only one possible way in which a given subject string can |
| match a pattern, the two algorithms give the same answer. A difference |
| arises, however, when there are multiple possibilities. For example, if |
| the pattern |
| |
| ^<.*> |
| |
| is matched against the string |
| |
| <something> <something else> <something further> |
| |
| there are three possible answers. The standard algorithm finds only one |
| of them, whereas the alternative algorithm finds all three. |
| |
| |
| REGULAR EXPRESSIONS AS TREES |
| |
| The set of strings that are matched by a regular expression can be rep- |
| resented as a tree structure. An unlimited repetition in the pattern |
| makes the tree of infinite size, but it is still a tree. Matching the |
| pattern to a given subject string (from a given starting point) can be |
| thought of as a search of the tree. There are two ways to search a |
| tree: depth-first and breadth-first, and these correspond to the two |
| matching algorithms provided by PCRE2. |
| |
| |
| THE STANDARD MATCHING ALGORITHM |
| |
| In the terminology of Jeffrey Friedl's book "Mastering Regular Expres- |
| sions", the standard algorithm is an "NFA algorithm". It conducts a |
| depth-first search of the pattern tree. That is, it proceeds along a |
| single path through the tree, checking that the subject matches what is |
| required. When there is a mismatch, the algorithm tries any alterna- |
| tives at the current point, and if they all fail, it backs up to the |
| previous branch point in the tree, and tries the next alternative |
| branch at that level. This often involves backing up (moving to the |
| left) in the subject string as well. The order in which repetition |
| branches are tried is controlled by the greedy or ungreedy nature of |
| the quantifier. |
| |
| If a leaf node is reached, a matching string has been found, and at |
| that point the algorithm stops. Thus, if there is more than one possi- |
| ble match, this algorithm returns the first one that it finds. Whether |
| this is the shortest, the longest, or some intermediate length depends |
| on the way the alternations and the greedy or ungreedy repetition quan- |
| tifiers are specified in the pattern. |
| |
| Because it ends up with a single path through the tree, it is rela- |
| tively straightforward for this algorithm to keep track of the sub- |
| strings that are matched by portions of the pattern in parentheses. |
| This provides support for capturing parentheses and backreferences. |
| |
| |
| THE ALTERNATIVE MATCHING ALGORITHM |
| |
| This algorithm conducts a breadth-first search of the tree. Starting |
| from the first matching point in the subject, it scans the subject |
| string from left to right, once, character by character, and as it does |
| this, it remembers all the paths through the tree that represent valid |
| matches. In Friedl's terminology, this is a kind of "DFA algorithm", |
| though it is not implemented as a traditional finite state machine (it |
| keeps multiple states active simultaneously). |
| |
| Although the general principle of this matching algorithm is that it |
| scans the subject string only once, without backtracking, there is one |
| exception: when a lookaround assertion is encountered, the characters |
| following or preceding the current point have to be independently in- |
| spected. |
| |
| The scan continues until either the end of the subject is reached, or |
| there are no more unterminated paths. At this point, terminated paths |
| represent the different matching possibilities (if there are none, the |
| match has failed). Thus, if there is more than one possible match, |
| this algorithm finds all of them, and in particular, it finds the long- |
| est. The matches are returned in the output vector in decreasing order |
| of length. There is an option to stop the algorithm after the first |
| match (which is necessarily the shortest) is found. |
| |
| Note that the size of vector needed to contain all the results depends |
| on the number of simultaneous matches, not on the number of parentheses |
| in the pattern. Using pcre2_match_data_create_from_pattern() to create |
| the match data block is therefore not advisable when doing DFA match- |
| ing. |
| |
| Note also that all the matches that are found start at the same point |
| in the subject. If the pattern |
| |
| cat(er(pillar)?)? |
| |
| is matched against the string "the caterpillar catchment", the result |
| is the three strings "caterpillar", "cater", and "cat" that start at |
| the fifth character of the subject. The algorithm does not automati- |
| cally move on to find matches that start at later positions. |
| |
| PCRE2's "auto-possessification" optimization usually applies to charac- |
| ter repeats at the end of a pattern (as well as internally). For exam- |
| ple, the pattern "a\d+" is compiled as if it were "a\d++" because there |
| is no point even considering the possibility of backtracking into the |
| repeated digits. For DFA matching, this means that only one possible |
| match is found. If you really do want multiple matches in such cases, |
| either use an ungreedy repeat ("a\d+?") or set the PCRE2_NO_AUTO_POS- |
| SESS option when compiling. |
| |
| There are a number of features of PCRE2 regular expressions that are |
| not supported or behave differently in the alternative matching func- |
| tion. Those that are not supported cause an error if encountered. |
| |
| 1. Because the algorithm finds all possible matches, the greedy or un- |
| greedy nature of repetition quantifiers is not relevant (though it may |
| affect auto-possessification, as just described). During matching, |
| greedy and ungreedy quantifiers are treated in exactly the same way. |
| However, possessive quantifiers can make a difference when what follows |
| could also match what is quantified, for example in a pattern like |
| this: |
| |
| ^a++\w! |
| |
| This pattern matches "aaab!" but not "aaa!", which would be matched by |
| a non-possessive quantifier. Similarly, if an atomic group is present, |
| it is matched as if it were a standalone pattern at the current point, |
| and the longest match is then "locked in" for the rest of the overall |
| pattern. |
| |
| 2. When dealing with multiple paths through the tree simultaneously, it |
| is not straightforward to keep track of captured substrings for the |
| different matching possibilities, and PCRE2's implementation of this |
| algorithm does not attempt to do this. This means that no captured sub- |
| strings are available. |
| |
| 3. Because no substrings are captured, backreferences within the pat- |
| tern are not supported. |
| |
| 4. For the same reason, conditional expressions that use a backrefer- |
| ence as the condition or test for a specific group recursion are not |
| supported. |
| |
| 5. Again for the same reason, script runs are not supported. |
| |
| 6. Because many paths through the tree may be active, the \K escape se- |
| quence, which resets the start of the match when encountered (but may |
| be on some paths and not on others), is not supported. |
| |
| 7. Callouts are supported, but the value of the capture_top field is |
| always 1, and the value of the capture_last field is always 0. |
| |
| 8. The \C escape sequence, which (in the standard algorithm) always |
| matches a single code unit, even in a UTF mode, is not supported in |
| these modes, because the alternative algorithm moves through the sub- |
| ject string one character (not code unit) at a time, for all active |
| paths through the tree. |
| |
| 9. Except for (*FAIL), the backtracking control verbs such as (*PRUNE) |
| are not supported. (*FAIL) is supported, and behaves like a failing |
| negative assertion. |
| |
| 10. The PCRE2_MATCH_INVALID_UTF option for pcre2_compile() is not sup- |
| ported by pcre2_dfa_match(). |
| |
| |
| ADVANTAGES OF THE ALTERNATIVE ALGORITHM |
| |
| The main advantage of the alternative algorithm is that all possible |
| matches (at a single point in the subject) are automatically found, and |
| in particular, the longest match is found. To find more than one match |
| at the same point using the standard algorithm, you have to do kludgy |
| things with callouts. |
| |
| Partial matching is possible with this algorithm, though it has some |
| limitations. The pcre2partial documentation gives details of partial |
| matching and discusses multi-segment matching. |
| |
| |
| DISADVANTAGES OF THE ALTERNATIVE ALGORITHM |
| |
| The alternative algorithm suffers from a number of disadvantages: |
| |
| 1. It is substantially slower than the standard algorithm. This is |
| partly because it has to search for all possible matches, but is also |
| because it is less susceptible to optimization. |
| |
| 2. Capturing parentheses, backreferences, script runs, and matching |
| within invalid UTF string are not supported. |
| |
| 3. Although atomic groups are supported, their use does not provide the |
| performance advantage that it does for the standard algorithm. |
| |
| 4. JIT optimization is not supported. |
| |
| |
| AUTHOR |
| |
| Philip Hazel |
| Retired from University Computing Service |
| Cambridge, England. |
| |
| |
| REVISION |
| |
| Last updated: 28 August 2021 |
| Copyright (c) 1997-2021 University of Cambridge. |
| ------------------------------------------------------------------------------ |
| |
| |
| PCRE2PARTIAL(3) Library Functions Manual PCRE2PARTIAL(3) |
| |
| |
| |
| NAME |
| PCRE2 - Perl-compatible regular expressions |
| |
| PARTIAL MATCHING IN PCRE2 |
| |
| In normal use of PCRE2, if there is a match up to the end of a subject |
| string, but more characters are needed to match the entire pattern, |
| PCRE2_ERROR_NOMATCH is returned, just like any other failing match. |
| There are circumstances where it might be helpful to distinguish this |
| "partial match" case. |
| |
| One example is an application where the subject string is very long, |
| and not all available at once. The requirement here is to be able to do |
| the matching segment by segment, but special action is needed when a |
| matched substring spans the boundary between two segments. |
| |
| Another example is checking a user input string as it is typed, to en- |
| sure that it conforms to a required format. Invalid characters can be |
| immediately diagnosed and rejected, giving instant feedback. |
| |
| Partial matching is a PCRE2-specific feature; it is not Perl-compati- |
| ble. It is requested by setting one of the PCRE2_PARTIAL_HARD or |
| PCRE2_PARTIAL_SOFT options when calling a matching function. The dif- |
| ference between the two options is whether or not a partial match is |
| preferred to an alternative complete match, though the details differ |
| between the two types of matching function. If both options are set, |
| PCRE2_PARTIAL_HARD takes precedence. |
| |
| If you want to use partial matching with just-in-time optimized code, |
| as well as setting a partial match option for the matching function, |
| you must also call pcre2_jit_compile() with one or both of these op- |
| tions: |
| |
| PCRE2_JIT_PARTIAL_HARD |
| PCRE2_JIT_PARTIAL_SOFT |
| |
| PCRE2_JIT_COMPLETE should also be set if you are going to run non-par- |
| tial matches on the same pattern. Separate code is compiled for each |
| mode. If the appropriate JIT mode has not been compiled, interpretive |
| matching code is used. |
| |
| Setting a partial matching option disables two of PCRE2's standard op- |
| timization hints. PCRE2 remembers the last literal code unit in a pat- |
| tern, and abandons matching immediately if it is not present in the |
| subject string. This optimization cannot be used for a subject string |
| that might match only partially. PCRE2 also remembers a minimum length |
| of a matching string, and does not bother to run the matching function |
| on shorter strings. This optimization is also disabled for partial |
| matching. |
| |
| |
| REQUIREMENTS FOR A PARTIAL MATCH |
| |
| A possible partial match occurs during matching when the end of the |
| subject string is reached successfully, but either more characters are |
| needed to complete the match, or the addition of more characters might |
| change what is matched. |
| |
| Example 1: if the pattern is /abc/ and the subject is "ab", more char- |
| acters are definitely needed to complete a match. In this case both |
| hard and soft matching options yield a partial match. |
| |
| Example 2: if the pattern is /ab+/ and the subject is "ab", a complete |
| match can be found, but the addition of more characters might change |
| what is matched. In this case, only PCRE2_PARTIAL_HARD returns a par- |
| tial match; PCRE2_PARTIAL_SOFT returns the complete match. |
| |
| On reaching the end of the subject, when PCRE2_PARTIAL_HARD is set, if |
| the next pattern item is \z, \Z, \b, \B, or $ there is always a partial |
| match. Otherwise, for both options, the next pattern item must be one |
| that inspects a character, and at least one of the following must be |
| true: |
| |
| (1) At least one character has already been inspected. An inspected |
| character need not form part of the final matched string; lookbehind |
| assertions and the \K escape sequence provide ways of inspecting char- |
| acters before the start of a matched string. |
| |
| (2) The pattern contains one or more lookbehind assertions. This condi- |
| tion exists in case there is a lookbehind that inspects characters be- |
| fore the start of the match. |
| |
| (3) There is a special case when the whole pattern can match an empty |
| string. When the starting point is at the end of the subject, the |
| empty string match is a possibility, and if PCRE2_PARTIAL_SOFT is set |
| and neither of the above conditions is true, it is returned. However, |
| because adding more characters might result in a non-empty match, |
| PCRE2_PARTIAL_HARD returns a partial match, which in this case means |
| "there is going to be a match at this point, but until some more char- |
| acters are added, we do not know if it will be an empty string or some- |
| thing longer". |
| |
| |
| PARTIAL MATCHING USING pcre2_match() |
| |
| When a partial matching option is set, the result of calling |
| pcre2_match() can be one of the following: |
| |
| A successful match |
| A complete match has been found, starting and ending within this sub- |
| ject. |
| |
| PCRE2_ERROR_NOMATCH |
| No match can start anywhere in this subject. |
| |
| PCRE2_ERROR_PARTIAL |
| Adding more characters may result in a complete match that uses one |
| or more characters from the end of this subject. |
| |
| When a partial match is returned, the first two elements in the ovector |
| point to the portion of the subject that was matched, but the values in |
| the rest of the ovector are undefined. The appearance of \K in the pat- |
| tern has no effect for a partial match. Consider this pattern: |
| |
| /abc\K123/ |
| |
| If it is matched against "456abc123xyz" the result is a complete match, |
| and the ovector defines the matched string as "123", because \K resets |
| the "start of match" point. However, if a partial match is requested |
| and the subject string is "456abc12", a partial match is found for the |
| string "abc12", because all these characters are needed for a subse- |
| quent re-match with additional characters. |
| |
| If there is more than one partial match, the first one that was found |
| provides the data that is returned. Consider this pattern: |
| |
| /123\w+X|dogY/ |
| |
| If this is matched against the subject string "abc123dog", both alter- |
| natives fail to match, but the end of the subject is reached during |
| matching, so PCRE2_ERROR_PARTIAL is returned. The offsets are set to 3 |
| and 9, identifying "123dog" as the first partial match. (In this exam- |
| ple, there are two partial matches, because "dog" on its own partially |
| matches the second alternative.) |
| |
| How a partial match is processed by pcre2_match() |
| |
| What happens when a partial match is identified depends on which of the |
| two partial matching options is set. |
| |
| If PCRE2_PARTIAL_HARD is set, PCRE2_ERROR_PARTIAL is returned as soon |
| as a partial match is found, without continuing to search for possible |
| complete matches. This option is "hard" because it prefers an earlier |
| partial match over a later complete match. For this reason, the assump- |
| tion is made that the end of the supplied subject string is not the |
| true end of the available data, which is why \z, \Z, \b, \B, and $ al- |
| ways give a partial match. |
| |
| If PCRE2_PARTIAL_SOFT is set, the partial match is remembered, but |
| matching continues as normal, and other alternatives in the pattern are |
| tried. If no complete match can be found, PCRE2_ERROR_PARTIAL is re- |
| turned instead of PCRE2_ERROR_NOMATCH. This option is "soft" because it |
| prefers a complete match over a partial match. All the various matching |
| items in a pattern behave as if the subject string is potentially com- |
| plete; \z, \Z, and $ match at the end of the subject, as normal, and |
| for \b and \B the end of the subject is treated as a non-alphanumeric. |
| |
| The difference between the two partial matching options can be illus- |
| trated by a pattern such as: |
| |
| /dog(sbody)?/ |
| |
| This matches either "dog" or "dogsbody", greedily (that is, it prefers |
| the longer string if possible). If it is matched against the string |
| "dog" with PCRE2_PARTIAL_SOFT, it yields a complete match for "dog". |
| However, if PCRE2_PARTIAL_HARD is set, the result is PCRE2_ERROR_PAR- |
| TIAL. On the other hand, if the pattern is made ungreedy the result is |
| different: |
| |
| /dog(sbody)??/ |
| |
| In this case the result is always a complete match because that is |
| found first, and matching never continues after finding a complete |
| match. It might be easier to follow this explanation by thinking of the |
| two patterns like this: |
| |
| /dog(sbody)?/ is the same as /dogsbody|dog/ |
| /dog(sbody)??/ is the same as /dog|dogsbody/ |
| |
| The second pattern will never match "dogsbody", because it will always |
| find the shorter match first. |
| |
| Example of partial matching using pcre2test |
| |
| The pcre2test data modifiers partial_hard (or ph) and partial_soft (or |
| ps) set PCRE2_PARTIAL_HARD and PCRE2_PARTIAL_SOFT, respectively, when |
| calling pcre2_match(). Here is a run of pcre2test using a pattern that |
| matches the whole subject in the form of a date: |
| |
| re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/ |
| data> 25dec3\=ph |
| Partial match: 23dec3 |
| data> 3ju\=ph |
| Partial match: 3ju |
| data> 3juj\=ph |
| No match |
| |
| This example gives the same results for both hard and soft partial |
| matching options. Here is an example where there is a difference: |
| |
| re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/ |
| data> 25jun04\=ps |
| 0: 25jun04 |
| 1: jun |
| data> 25jun04\=ph |
| Partial match: 25jun04 |
| |
| With PCRE2_PARTIAL_SOFT, the subject is matched completely. For |
| PCRE2_PARTIAL_HARD, however, the subject is assumed not to be complete, |
| so there is only a partial match. |
| |
| |
| MULTI-SEGMENT MATCHING WITH pcre2_match() |
| |
| PCRE was not originally designed with multi-segment matching in mind. |
| However, over time, features (including partial matching) that make |
| multi-segment matching possible have been added. A very long string can |
| be searched segment by segment by calling pcre2_match() repeatedly, |
| with the aim of achieving the same results that would happen if the en- |
| tire string was available for searching all the time. Normally, the |
| strings that are being sought are much shorter than each individual |
| segment, and are in the middle of very long strings, so the pattern is |
| normally not anchored. |
| |
| Special logic must be implemented to handle a matched substring that |
| spans a segment boundary. PCRE2_PARTIAL_HARD should be used, because it |
| returns a partial match at the end of a segment whenever there is the |
| possibility of changing the match by adding more characters. The |
| PCRE2_NOTBOL option should also be set for all but the first segment. |
| |
| When a partial match occurs, the next segment must be added to the cur- |
| rent subject and the match re-run, using the startoffset argument of |
| pcre2_match() to begin at the point where the partial match started. |
| For example: |
| |
| re> /\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d/ |
| data> ...the date is 23ja\=ph |
| Partial match: 23ja |
| data> ...the date is 23jan19 and on that day...\=offset=15 |
| 0: 23jan19 |
| 1: jan |
| |
| Note the use of the offset modifier to start the new match where the |
| partial match was found. In this example, the next segment was added to |
| the one in which the partial match was found. This is the most |
| straightforward approach, typically using a memory buffer that is twice |
| the size of each segment. After a partial match, the first half of the |
| buffer is discarded, the second half is moved to the start of the buf- |
| fer, and a new segment is added before repeating the match as in the |
| example above. After a no match, the entire buffer can be discarded. |
| |
| If there are memory constraints, you may want to discard text that pre- |
| cedes a partial match before adding the next segment. Unfortunately, |
| this is not at present straightforward. In cases such as the above, |
| where the pattern does not contain any lookbehinds, it is sufficient to |
| retain only the partially matched substring. However, if the pattern |
| contains a lookbehind assertion, characters that precede the start of |
| the partial match may have been inspected during the matching process. |
| When pcre2test displays a partial match, it indicates these characters |
| with '<' if the allusedtext modifier is set: |
| |
| re> "(?<=123)abc" |
| data> xx123ab\=ph,allusedtext |
| Partial match: 123ab |
| <<< |
| |
| However, the allusedtext modifier is not available for JIT matching, |
| because JIT matching does not record the first (or last) consulted |
| characters. For this reason, this information is not available via the |
| API. It is therefore not possible in general to obtain the exact number |
| of characters that must be retained in order to get the right match re- |
| sult. If you cannot retain the entire segment, you must find some |
| heuristic way of choosing. |
| |
| If you know the approximate length of the matching substrings, you can |
| use that to decide how much text to retain. The only lookbehind infor- |
| mation that is currently available via the API is the length of the |
| longest individual lookbehind in a pattern, but this can be misleading |
| if there are nested lookbehinds. The value returned by calling |
| pcre2_pattern_info() with the PCRE2_INFO_MAXLOOKBEHIND option is the |
| maximum number of characters (not code units) that any individual look- |
| behind moves back when it is processed. A pattern such as |
| "(?<=(?<!b)a)" has a maximum lookbehind value of one, but inspects two |
| characters before its starting point. |
| |
| In a non-UTF or a 32-bit case, moving back is just a subtraction, but |
| in UTF-8 or UTF-16 you have to count characters while moving back |
| through the code units. |
| |
| |
| PARTIAL MATCHING USING pcre2_dfa_match() |
| |
| The DFA function moves along the subject string character by character, |
| without backtracking, searching for all possible matches simultane- |
| ously. If the end of the subject is reached before the end of the pat- |
| tern, there is the possibility of a partial match. |
| |
| When PCRE2_PARTIAL_SOFT is set, PCRE2_ERROR_PARTIAL is returned only if |
| there have been no complete matches. Otherwise, the complete matches |
| are returned. If PCRE2_PARTIAL_HARD is set, a partial match takes |
| precedence over any complete matches. The portion of the string that |
| was matched when the longest partial match was found is set as the |
| first matching string. |
| |
| Because the DFA function always searches for all possible matches, and |
| there is no difference between greedy and ungreedy repetition, its be- |
| haviour is different from the pcre2_match(). Consider the string "dog" |
| matched against this ungreedy pattern: |
| |
| /dog(sbody)??/ |
| |
| Whereas the standard function stops as soon as it finds the complete |
| match for "dog", the DFA function also finds the partial match for |
| "dogsbody", and so returns that when PCRE2_PARTIAL_HARD is set. |
| |
| |
| MULTI-SEGMENT MATCHING WITH pcre2_dfa_match() |
| |
| When a partial match has been found using the DFA matching function, it |
| is possible to continue the match by providing additional subject data |
| and calling the function again with the same compiled regular expres- |
| sion, this time setting the PCRE2_DFA_RESTART option. You must pass the |
| same working space as before, because this is where details of the pre- |
| vious partial match are stored. You can set the PCRE2_PARTIAL_SOFT or |
| PCRE2_PARTIAL_HARD options with PCRE2_DFA_RESTART to continue partial |
| matching over multiple segments. Here is an example using pcre2test: |
| |
| re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/ |
| data> 23ja\=dfa,ps |
| Partial match: 23ja |
| data> n05\=dfa,dfa_restart |
| 0: n05 |
| |
| The first call has "23ja" as the subject, and requests partial match- |
| ing; the second call has "n05" as the subject for the continued |
| (restarted) match. Notice that when the match is complete, only the |
| last part is shown; PCRE2 does not retain the previously partially- |
| matched string. It is up to the calling program to do that if it needs |
| to. This means that, for an unanchored pattern, if a continued match |
| fails, it is not possible to try again at a new starting point. All |
| this facility is capable of doing is continuing with the previous match |
| attempt. For example, consider this pattern: |
| |
| 1234|3789 |
| |
| If the first part of the subject is "ABC123", a partial match of the |
| first alternative is found at offset 3. There is no partial match for |
| the second alternative, because such a match does not start at the same |
| point in the subject string. Attempting to continue with the string |
| "7890" does not yield a match because only those alternatives that |
| match at one point in the subject are remembered. Depending on the ap- |
| plication, this may or may not be what you want. |
| |
| If you do want to allow for starting again at the next character, one |
| way of doing it is to retain some or all of the segment and try a new |
| complete match, as described for pcre2_match() above. Another possibil- |
| ity is to work with two buffers. If a partial match at offset n in the |
| first buffer is followed by "no match" when PCRE2_DFA_RESTART is used |
| on the second buffer, you can then try a new match starting at offset |
| n+1 in the first buffer. |
| |
| |
| AUTHOR |
| |
| Philip Hazel |
| University Computing Service |
| Cambridge, England. |
| |
| |
| REVISION |
| |
| Last updated: 04 September 2019 |
| Copyright (c) 1997-2019 University of Cambridge. |
| ------------------------------------------------------------------------------ |
| |
| |
| PCRE2PATTERN(3) Library Functions Manual PCRE2PATTERN(3) |
| |
| |
| |
| NAME |
| PCRE2 - Perl-compatible regular expressions (revised API) |
| |
| PCRE2 REGULAR EXPRESSION DETAILS |
| |
| The syntax and semantics of the regular expressions that are supported |
| by PCRE2 are described in detail below. There is a quick-reference syn- |
| tax summary in the pcre2syntax page. PCRE2 tries to match Perl syntax |
| and semantics as closely as it can. PCRE2 also supports some alterna- |
| tive regular expression syntax (which does not conflict with the Perl |
| syntax) in order to provide some compatibility with regular expressions |
| in Python, .NET, and Oniguruma. |
| |
| Perl's regular expressions are described in its own documentation, and |
| regular expressions in general are covered in a number of books, some |
| of which have copious examples. Jeffrey Friedl's "Mastering Regular Ex- |
| pressions", published by O'Reilly, covers regular expressions in great |
| detail. This description of PCRE2's regular expressions is intended as |
| reference material. |
| |
| This document discusses the regular expression patterns that are sup- |
| ported by PCRE2 when its main matching function, pcre2_match(), is |
| used. PCRE2 also has an alternative matching function, |
| pcre2_dfa_match(), which matches using a different algorithm that is |
| not Perl-compatible. Some of the features discussed below are not |
| available when DFA matching is used. The advantages and disadvantages |
| of the alternative function, and how it differs from the normal func- |
| tion, are discussed in the pcre2matching page. |
| |
| |
| SPECIAL START-OF-PATTERN ITEMS |
| |
| A number of options that can be passed to pcre2_compile() can also be |
| set by special items at the start of a pattern. These are not Perl-com- |
| patible, but are provided to make these options accessible to pattern |
| writers who are not able to change the program that processes the pat- |
| tern. Any number of these items may appear, but they must all be to- |
| gether right at the start of the pattern string, and the letters must |
| be in upper case. |
| |
| UTF support |
| |
| In the 8-bit and 16-bit PCRE2 libraries, characters may be coded either |
| as single code units, or as multiple UTF-8 or UTF-16 code units. UTF-32 |
| can be specified for the 32-bit library, in which case it constrains |
| the character values to valid Unicode code points. To process UTF |
| strings, PCRE2 must be built to include Unicode support (which is the |
| default). When using UTF strings you must either call the compiling |
| function with one or both of the PCRE2_UTF or PCRE2_MATCH_INVALID_UTF |
| options, or the pattern must start with the special sequence (*UTF), |
| which is equivalent to setting the relevant PCRE2_UTF. How setting a |
| UTF mode affects pattern matching is mentioned in several places below. |
| There is also a summary of features in the pcre2unicode page. |
| |
| Some applications that allow their users to supply patterns may wish to |
| restrict them to non-UTF data for security reasons. If the |
| PCRE2_NEVER_UTF option is passed to pcre2_compile(), (*UTF) is not al- |
| lowed, and its appearance in a pattern causes an error. |
| |
| Unicode property support |
| |
| Another special sequence that may appear at the start of a pattern is |
| (*UCP). This has the same effect as setting the PCRE2_UCP option: it |
| causes sequences such as \d and \w to use Unicode properties to deter- |
| mine character types, instead of recognizing only characters with codes |
| less than 256 via a lookup table. If also causes upper/lower casing op- |
| erations to use Unicode properties for characters with code points |
| greater than 127, even when UTF is not set. |
| |
| Some applications that allow their users to supply patterns may wish to |
| restrict them for security reasons. If the PCRE2_NEVER_UCP option is |
| passed to pcre2_compile(), (*UCP) is not allowed, and its appearance in |
| a pattern causes an error. |
| |
| Locking out empty string matching |
| |
| Starting a pattern with (*NOTEMPTY) or (*NOTEMPTY_ATSTART) has the same |
| effect as passing the PCRE2_NOTEMPTY or PCRE2_NOTEMPTY_ATSTART option |
| to whichever matching function is subsequently called to match the pat- |
| tern. These options lock out the matching of empty strings, either en- |
| tirely, or only at the start of the subject. |
| |
| Disabling auto-possessification |
| |
| If a pattern starts with (*NO_AUTO_POSSESS), it has the same effect as |
| setting the PCRE2_NO_AUTO_POSSESS option. This stops PCRE2 from making |
| quantifiers possessive when what follows cannot match the repeated |
| item. For example, by default a+b is treated as a++b. For more details, |
| see the pcre2api documentation. |
| |
| Disabling start-up optimizations |
| |
| If a pattern starts with (*NO_START_OPT), it has the same effect as |
| setting the PCRE2_NO_START_OPTIMIZE option. This disables several opti- |
| mizations for quickly reaching "no match" results. For more details, |
| see the pcre2api documentation. |
| |
| Disabling automatic anchoring |
| |
| If a pattern starts with (*NO_DOTSTAR_ANCHOR), it has the same effect |
| as setting the PCRE2_NO_DOTSTAR_ANCHOR option. This disables optimiza- |
| tions that apply to patterns whose top-level branches all start with .* |
| (match any number of arbitrary characters). For more details, see the |
| pcre2api documentation. |
| |
| Disabling JIT compilation |
| |
| If a pattern that starts with (*NO_JIT) is successfully compiled, an |
| attempt by the application to apply the JIT optimization by calling |
| pcre2_jit_compile() is ignored. |
| |
| Setting match resource limits |
| |
| The pcre2_match() function contains a counter that is incremented every |
| time it goes round its main loop. The caller of pcre2_match() can set a |
| limit on this counter, which therefore limits the amount of computing |
| resource used for a match. The maximum depth of nested backtracking can |
| also be limited; this indirectly restricts the amount of heap memory |
| that is used, but there is also an explicit memory limit that can be |
| set. |
| |
| These facilities are provided to catch runaway matches that are pro- |
| voked by patterns with huge matching trees. A common example is a pat- |
| tern with nested unlimited repeats applied to a long string that does |
| not match. When one of these limits is reached, pcre2_match() gives an |
| error return. The limits can also be set by items at the start of the |
| pattern of the form |
| |
| (*LIMIT_HEAP=d) |
| (*LIMIT_MATCH=d) |
| (*LIMIT_DEPTH=d) |
| |
| where d is any number of decimal digits. However, the value of the set- |
| ting must be less than the value set (or defaulted) by the caller of |
| pcre2_match() for it to have any effect. In other words, the pattern |
| writer can lower the limits set by the programmer, but not raise them. |
| If there is more than one setting of one of these limits, the lower |
| value is used. The heap limit is specified in kibibytes (units of 1024 |
| bytes). |
| |
| Prior to release 10.30, LIMIT_DEPTH was called LIMIT_RECURSION. This |
| name is still recognized for backwards compatibility. |
| |
| The heap limit applies only when the pcre2_match() or pcre2_dfa_match() |
| interpreters are used for matching. It does not apply to JIT. The match |
| limit is used (but in a different way) when JIT is being used, or when |
| pcre2_dfa_match() is called, to limit computing resource usage by those |
| matching functions. The depth limit is ignored by JIT but is relevant |
| for DFA matching, which uses function recursion for recursions within |
| the pattern and for lookaround assertions and atomic groups. In this |
| case, the depth limit controls the depth of such recursion. |
| |
| Newline conventions |
| |
| PCRE2 supports six different conventions for indicating line breaks in |
| strings: a single CR (carriage return) character, a single LF (line- |
| feed) character, the two-character sequence CRLF, any of the three pre- |
| ceding, any Unicode newline sequence, or the NUL character (binary |
| zero). The pcre2api page has further discussion about newlines, and |
| shows how to set the newline convention when calling pcre2_compile(). |
| |
| It is also possible to specify a newline convention by starting a pat- |
| tern string with one of the following sequences: |
| |
| (*CR) carriage return |
| (*LF) linefeed |
| (*CRLF) carriage return, followed by linefeed |
| (*ANYCRLF) any of the three above |
| (*ANY) all Unicode newline sequences |
| (*NUL) the NUL character (binary zero) |
| |
| These override the default and the options given to the compiling func- |
| tion. For example, on a Unix system where LF is the default newline se- |
| quence, the pattern |
| |
| (*CR)a.b |
| |
| changes the convention to CR. That pattern matches "a\nb" because LF is |
| no longer a newline. If more than one of these settings is present, the |
| last one is used. |
| |
| The newline convention affects where the circumflex and dollar asser- |
| tions are true. It also affects the interpretation of the dot metachar- |
| acter when PCRE2_DOTALL is not set, and the behaviour of \N when not |
| followed by an opening brace. However, it does not affect what the \R |
| escape sequence matches. By default, this is any Unicode newline se- |
| quence, for Perl compatibility. However, this can be changed; see the |
| next section and the description of \R in the section entitled "Newline |
| sequences" below. A change of \R setting can be combined with a change |
| of newline convention. |
| |
| Specifying what \R matches |
| |
| It is possible to restrict \R to match only CR, LF, or CRLF (instead of |
| the complete set of Unicode line endings) by setting the option |
| PCRE2_BSR_ANYCRLF at compile time. This effect can also be achieved by |
| starting a pattern with (*BSR_ANYCRLF). For completeness, (*BSR_UNI- |
| CODE) is also recognized, corresponding to PCRE2_BSR_UNICODE. |
| |
| |
| EBCDIC CHARACTER CODES |
| |
| PCRE2 can be compiled to run in an environment that uses EBCDIC as its |
| character code instead of ASCII or Unicode (typically a mainframe sys- |
| tem). In the sections below, character code values are ASCII or Uni- |
| code; in an EBCDIC environment these characters may have different code |
| values, and there are no code points greater than 255. |
| |
| |
| CHARACTERS AND METACHARACTERS |
| |
| A regular expression is a pattern that is matched against a subject |
| string from left to right. Most characters stand for themselves in a |
| pattern, and match the corresponding characters in the subject. As a |
| trivial example, the pattern |
| |
| The quick brown fox |
| |
| matches a portion of a subject string that is identical to itself. When |
| caseless matching is specified (the PCRE2_CASELESS option or (?i) |
| within the pattern), letters are matched independently of case. Note |
| that there are two ASCII characters, K and S, that, in addition to |
| their lower case ASCII equivalents, are case-equivalent with Unicode |
| U+212A (Kelvin sign) and U+017F (long S) respectively when either |
| PCRE2_UTF or PCRE2_UCP is set. |
| |
| The power of regular expressions comes from the ability to include wild |
| cards, character classes, alternatives, and repetitions in the pattern. |
| These are encoded in the pattern by the use of metacharacters, which do |
| not stand for themselves but instead are interpreted in some special |
| way. |
| |
| There are two different sets of metacharacters: those that are recog- |
| nized anywhere in the pattern except within square brackets, and those |
| that are recognized within square brackets. Outside square brackets, |
| the metacharacters are as follows: |
| |
| \ general escape character with several uses |
| ^ assert start of string (or line, in multiline mode) |
| $ assert end of string (or line, in multiline mode) |
| . match any character except newline (by default) |
| [ start character class definition |
| | start of alternative branch |
| ( start group or control verb |
| ) end group or control verb |
| * 0 or more quantifier |
| + 1 or more quantifier; also "possessive quantifier" |
| ? 0 or 1 quantifier; also quantifier minimizer |
| { start min/max quantifier |
| |
| Part of a pattern that is in square brackets is called a "character |
| class". In a character class the only metacharacters are: |
| |
| \ general escape character |
| ^ negate the class, but only if the first character |
| - indicates character range |
| [ POSIX character class (if followed by POSIX syntax) |
| ] terminates the character class |
| |
| If a pattern is compiled with the PCRE2_EXTENDED option, most white |
| space in the pattern, other than in a character class, and characters |
| between a # outside a character class and the next newline, inclusive, |
| are ignored. An escaping backslash can be used to include a white space |
| or a # character as part of the pattern. If the PCRE2_EXTENDED_MORE op- |
| tion is set, the same applies, but in addition unescaped space and hor- |
| izontal tab characters are ignored inside a character class. Note: only |
| these two characters are ignored, not the full set of pattern white |
| space characters that are ignored outside a character class. Option |
| settings can be changed within a pattern; see the section entitled "In- |
| ternal Option Setting" below. |
| |
| The following sections describe the use of each of the metacharacters. |
| |
| |
| BACKSLASH |
| |
| The backslash character has several uses. Firstly, if it is followed by |
| a character that is not a digit or a letter, it takes away any special |
| meaning that character may have. This use of backslash as an escape |
| character applies both inside and outside character classes. |
| |
| For example, if you want to match a * character, you must write \* in |
| the pattern. This escaping action applies whether or not the following |
| character would otherwise be interpreted as a metacharacter, so it is |
| always safe to precede a non-alphanumeric with backslash to specify |
| that it stands for itself. In particular, if you want to match a back- |
| slash, you write \\. |
| |
| Only ASCII digits and letters have any special meaning after a back- |
| slash. All other characters (in particular, those whose code points are |
| greater than 127) are treated as literals. |
| |
| If you want to treat all characters in a sequence as literals, you can |
| do so by putting them between \Q and \E. This is different from Perl in |
| that $ and @ are handled as literals in \Q...\E sequences in PCRE2, |
| whereas in Perl, $ and @ cause variable interpolation. Also, Perl does |
| "double-quotish backslash interpolation" on any backslashes between \Q |
| and \E which, its documentation says, "may lead to confusing results". |
| PCRE2 treats a backslash between \Q and \E just like any other charac- |
| ter. Note the following examples: |
| |
| Pattern PCRE2 matches Perl matches |
| |
| \Qabc$xyz\E abc$xyz abc followed by the |
| contents of $xyz |
| \Qabc\$xyz\E abc\$xyz abc\$xyz |
| \Qabc\E\$\Qxyz\E abc$xyz abc$xyz |
| \QA\B\E A\B A\B |
| \Q\\E \ \\E |
| |
| The \Q...\E sequence is recognized both inside and outside character |
| classes. An isolated \E that is not preceded by \Q is ignored. If \Q |
| is not followed by \E later in the pattern, the literal interpretation |
| continues to the end of the pattern (that is, \E is assumed at the |
| end). If the isolated \Q is inside a character class, this causes an |
| error, because the character class is not terminated by a closing |
| square bracket. |
| |
| Non-printing characters |
| |
| A second use of backslash provides a way of encoding non-printing char- |
| acters in patterns in a visible manner. There is no restriction on the |
| appearance of non-printing characters in a pattern, but when a pattern |
| is being prepared by text editing, it is often easier to use one of the |
| following escape sequences instead of the binary character it repre- |
| sents. In an ASCII or Unicode environment, these escapes are as fol- |
| lows: |
| |
| \a alarm, that is, the BEL character (hex 07) |
| \cx "control-x", where x is any printable ASCII character |
| \e escape (hex 1B) |
| \f form feed (hex 0C) |
| \n linefeed (hex 0A) |
| \r carriage return (hex 0D) (but see below) |
| \t tab (hex 09) |
| \0dd character with octal code 0dd |
| \ddd character with octal code ddd, or backreference |
| \o{ddd..} character with octal code ddd.. |
| \xhh character with hex code hh |
| \x{hhh..} character with hex code hhh.. |
| \N{U+hhh..} character with Unicode hex code point hhh.. |
| |
| By default, after \x that is not followed by {, from zero to two hexa- |
| decimal digits are read (letters can be in upper or lower case). Any |
| number of hexadecimal digits may appear between \x{ and }. If a charac- |
| ter other than a hexadecimal digit appears between \x{ and }, or if |
| there is no terminating }, an error occurs. |
| |
| Characters whose code points are less than 256 can be defined by either |
| of the two syntaxes for \x or by an octal sequence. There is no differ- |
| ence in the way they are handled. For example, \xdc is exactly the same |
| as \x{dc} or \334. However, using the braced versions does make such |
| sequences easier to read. |
| |
| Support is available for some ECMAScript (aka JavaScript) escape se- |
| quences via two compile-time options. If PCRE2_ALT_BSUX is set, the se- |
| quence \x followed by { is not recognized. Only if \x is followed by |
| two hexadecimal digits is it recognized as a character escape. Other- |
| wise it is interpreted as a literal "x" character. In this mode, sup- |
| port for code points greater than 256 is provided by \u, which must be |
| followed by four hexadecimal digits; otherwise it is interpreted as a |
| literal "u" character. |
| |
| PCRE2_EXTRA_ALT_BSUX has the same effect as PCRE2_ALT_BSUX and, in ad- |
| dition, \u{hhh..} is recognized as the character specified by hexadeci- |
| mal code point. There may be any number of hexadecimal digits. This |
| syntax is from ECMAScript 6. |
| |
| The \N{U+hhh..} escape sequence is recognized only when PCRE2 is oper- |
| ating in UTF mode. Perl also uses \N{name} to specify characters by |
| Unicode name; PCRE2 does not support this. Note that when \N is not |
| followed by an opening brace (curly bracket) it has an entirely differ- |
| ent meaning, matching any character that is not a newline. |
| |
| There are some legacy applications where the escape sequence \r is ex- |
| pected to match a newline. If the PCRE2_EXTRA_ESCAPED_CR_IS_LF option |
| is set, \r in a pattern is converted to \n so that it matches a LF |
| (linefeed) instead of a CR (carriage return) character. |
| |
| The precise effect of \cx on ASCII characters is as follows: if x is a |
| lower case letter, it is converted to upper case. Then bit 6 of the |
| character (hex 40) is inverted. Thus \cA to \cZ become hex 01 to hex 1A |
| (A is 41, Z is 5A), but \c{ becomes hex 3B ({ is 7B), and \c; becomes |
| hex 7B (; is 3B). If the code unit following \c has a value less than |
| 32 or greater than 126, a compile-time error occurs. |
| |
| When PCRE2 is compiled in EBCDIC mode, \N{U+hhh..} is not supported. |
| \a, \e, \f, \n, \r, and \t generate the appropriate EBCDIC code values. |
| The \c escape is processed as specified for Perl in the perlebcdic doc- |
| ument. The only characters that are allowed after \c are A-Z, a-z, or |
| one of @, [, \, ], ^, _, or ?. Any other character provokes a compile- |
| time error. The sequence \c@ encodes character code 0; after \c the |
| letters (in either case) encode characters 1-26 (hex 01 to hex 1A); [, |
| \, ], ^, and _ encode characters 27-31 (hex 1B to hex 1F), and \c? be- |
| comes either 255 (hex FF) or 95 (hex 5F). |
| |
| Thus, apart from \c?, these escapes generate the same character code |
| values as they do in an ASCII environment, though the meanings of the |
| values mostly differ. For example, \cG always generates code value 7, |
| which is BEL in ASCII but DEL in EBCDIC. |
| |
| The sequence \c? generates DEL (127, hex 7F) in an ASCII environment, |
| but because 127 is not a control character in EBCDIC, Perl makes it |
| generate the APC character. Unfortunately, there are several variants |
| of EBCDIC. In most of them the APC character has the value 255 (hex |
| FF), but in the one Perl calls POSIX-BC its value is 95 (hex 5F). If |
| certain other characters have POSIX-BC values, PCRE2 makes \c? generate |
| 95; otherwise it generates 255. |
| |
| After \0 up to two further octal digits are read. If there are fewer |
| than two digits, just those that are present are used. Thus the se- |
| quence \0\x\015 specifies two binary zeros followed by a CR character |
| (code value 13). Make sure you supply two digits after the initial zero |
| if the pattern character that follows is itself an octal digit. |
| |
| The escape \o must be followed by a sequence of octal digits, enclosed |
| in braces. An error occurs if this is not the case. This escape is a |
| recent addition to Perl; it provides way of specifying character code |
| points as octal numbers greater than 0777, and it also allows octal |
| numbers and backreferences to be unambiguously specified. |
| |
| For greater clarity and unambiguity, it is best to avoid following \ by |
| a digit greater than zero. Instead, use \o{} or \x{} to specify numeri- |
| cal character code points, and \g{} to specify backreferences. The fol- |
| lowing paragraphs describe the old, ambiguous syntax. |
| |
| The handling of a backslash followed by a digit other than 0 is compli- |
| cated, and Perl has changed over time, causing PCRE2 also to change. |
| |
| Outside a character class, PCRE2 reads the digit and any following dig- |
| its as a decimal number. If the number is less than 10, begins with the |
| digit 8 or 9, or if there are at least that many previous capture |
| groups in the expression, the entire sequence is taken as a backrefer- |
| ence. A description of how this works is given later, following the |
| discussion of parenthesized groups. Otherwise, up to three octal dig- |
| its are read to form a character code. |
| |
| Inside a character class, PCRE2 handles \8 and \9 as the literal char- |
| acters "8" and "9", and otherwise reads up to three octal digits fol- |
| lowing the backslash, using them to generate a data character. Any sub- |
| sequent digits stand for themselves. For example, outside a character |
| class: |
| |
| \040 is another way of writing an ASCII space |
| \40 is the same, provided there are fewer than 40 |
| previous capture groups |
| \7 is always a backreference |
| \11 might be a backreference, or another way of |
| writing a tab |
| \011 is always a tab |
| \0113 is a tab followed by the character "3" |
| \113 might be a backreference, otherwise the |
| character with octal code 113 |
| \377 might be a backreference, otherwise |
| the value 255 (decimal) |
| \81 is always a backreference |
| |
| Note that octal values of 100 or greater that are specified using this |
| syntax must not be introduced by a leading zero, because no more than |
| three octal digits are ever read. |
| |
| Constraints on character values |
| |
| Characters that are specified using octal or hexadecimal numbers are |
| limited to certain values, as follows: |
| |
| 8-bit non-UTF mode no greater than 0xff |
| 16-bit non-UTF mode no greater than 0xffff |
| 32-bit non-UTF mode no greater than 0xffffffff |
| All UTF modes no greater than 0x10ffff and a valid code point |
| |
| Invalid Unicode code points are all those in the range 0xd800 to 0xdfff |
| (the so-called "surrogate" code points). The check for these can be |
| disabled by the caller of pcre2_compile() by setting the option |
| PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES. However, this is possible only in |
| UTF-8 and UTF-32 modes, because these values are not representable in |
| UTF-16. |
| |
| Escape sequences in character classes |
| |
| All the sequences that define a single character value can be used both |
| inside and outside character classes. In addition, inside a character |
| class, \b is interpreted as the backspace character (hex 08). |
| |
| When not followed by an opening brace, \N is not allowed in a character |
| class. \B, \R, and \X are not special inside a character class. Like |
| other unrecognized alphabetic escape sequences, they cause an error. |
| Outside a character class, these sequences have different meanings. |
| |
| Unsupported escape sequences |
| |
| In Perl, the sequences \F, \l, \L, \u, and \U are recognized by its |
| string handler and used to modify the case of following characters. By |
| default, PCRE2 does not support these escape sequences in patterns. |
| However, if either of the PCRE2_ALT_BSUX or PCRE2_EXTRA_ALT_BSUX op- |
| tions is set, \U matches a "U" character, and \u can be used to define |
| a character by code point, as described above. |
| |
| Absolute and relative backreferences |
| |
| The sequence \g followed by a signed or unsigned number, optionally en- |
| closed in braces, is an absolute or relative backreference. A named |
| backreference can be coded as \g{name}. Backreferences are discussed |
| later, following the discussion of parenthesized groups. |
| |
| Absolute and relative subroutine calls |
| |
| For compatibility with Oniguruma, the non-Perl syntax \g followed by a |
| name or a number enclosed either in angle brackets or single quotes, is |
| an alternative syntax for referencing a capture group as a subroutine. |
| Details are discussed later. Note that \g{...} (Perl syntax) and |
| \g<...> (Oniguruma syntax) are not synonymous. The former is a backref- |
| erence; the latter is a subroutine call. |
| |
| Generic character types |
| |
| Another use of backslash is for specifying generic character types: |
| |
| \d any decimal digit |
| \D any character that is not a decimal digit |
| \h any horizontal white space character |
| \H any character that is not a horizontal white space character |
| \N any character that is not a newline |
| \s any white space character |
| \S any character that is not a white space character |
| \v any vertical white space character |
| \V any character that is not a vertical white space character |
| \w any "word" character |
| \W any "non-word" character |
| |
| The \N escape sequence has the same meaning as the "." metacharacter |
| when PCRE2_DOTALL is not set, but setting PCRE2_DOTALL does not change |
| the meaning of \N. Note that when \N is followed by an opening brace it |
| has a different meaning. See the section entitled "Non-printing charac- |
| ters" above for details. Perl also uses \N{name} to specify characters |
| by Unicode name; PCRE2 does not support this. |
| |
| Each pair of lower and upper case escape sequences partitions the com- |
| plete set of characters into two disjoint sets. Any given character |
| matches one, and only one, of each pair. The sequences can appear both |
| inside and outside character classes. They each match one character of |
| the appropriate type. If the current matching point is at the end of |
| the subject string, all of them fail, because there is no character to |
| match. |
| |
| The default \s characters are HT (9), LF (10), VT (11), FF (12), CR |
| (13), and space (32), which are defined as white space in the "C" lo- |
| cale. This list may vary if locale-specific matching is taking place. |
| For example, in some locales the "non-breaking space" character (\xA0) |
| is recognized as white space, and in others the VT character is not. |
| |
| A "word" character is an underscore or any character that is a letter |
| or digit. By default, the definition of letters and digits is con- |
| trolled by PCRE2's low-valued character tables, and may vary if locale- |
| specific matching is taking place (see "Locale support" in the pcre2api |
| page). For example, in a French locale such as "fr_FR" in Unix-like |
| systems, or "french" in Windows, some character codes greater than 127 |
| are used for accented letters, and these are then matched by \w. The |
| use of locales with Unicode is discouraged. |
| |
| By default, characters whose code points are greater than 127 never |
| match \d, \s, or \w, and always match \D, \S, and \W, although this may |
| be different for characters in the range 128-255 when locale-specific |
| matching is happening. These escape sequences retain their original |
| meanings from before Unicode support was available, mainly for effi- |
| ciency reasons. If the PCRE2_UCP option is set, the behaviour is |
| changed so that Unicode properties are used to determine character |
| types, as follows: |
| |
| \d any character that matches \p{Nd} (decimal digit) |
| \s any character that matches \p{Z} or \h or \v |
| \w any character that matches \p{L} or \p{N}, plus underscore |
| |
| The upper case escapes match the inverse sets of characters. Note that |
| \d matches only decimal digits, whereas \w matches any Unicode digit, |
| as well as any Unicode letter, and underscore. Note also that PCRE2_UCP |
| affects \b, and \B because they are defined in terms of \w and \W. |
| Matching these sequences is noticeably slower when PCRE2_UCP is set. |
| |
| The sequences \h, \H, \v, and \V, in contrast to the other sequences, |
| which match only ASCII characters by default, always match a specific |
| list of code points, whether or not PCRE2_UCP is set. The horizontal |
| space characters are: |
| |
| U+0009 Horizontal tab (HT) |
| U+0020 Space |
| U+00A0 Non-break space |
| U+1680 Ogham space mark |
| U+180E Mongolian vowel separator |
| U+2000 En quad |
| U+2001 Em quad |
| U+2002 En space |
| U+2003 Em space |
| U+2004 Three-per-em space |
| U+2005 Four-per-em space |
| U+2006 Six-per-em space |
| U+2007 Figure space |
| U+2008 Punctuation space |
| U+2009 Thin space |
| U+200A Hair space |
| U+202F Narrow no-break space |
| U+205F Medium mathematical space |
| U+3000 Ideographic space |
| |
| The vertical space characters are: |
| |
| U+000A Linefeed (LF) |
| U+000B Vertical tab (VT) |
| U+000C Form feed (FF) |
| U+000D Carriage return (CR) |
| U+0085 Next line (NEL) |
| U+2028 Line separator |
| U+2029 Paragraph separator |
| |
| In 8-bit, non-UTF-8 mode, only the characters with code points less |
| than 256 are relevant. |
| |
| Newline sequences |
| |
| Outside a character class, by default, the escape sequence \R matches |
| any Unicode newline sequence. In 8-bit non-UTF-8 mode \R is equivalent |
| to the following: |
| |
| (?>\r\n|\n|\x0b|\f|\r|\x85) |
| |
| This is an example of an "atomic group", details of which are given be- |
| low. This particular group matches either the two-character sequence |
| CR followed by LF, or one of the single characters LF (linefeed, |
| U+000A), VT (vertical tab, U+000B), FF (form feed, U+000C), CR (car- |
| riage return, U+000D), or NEL (next line, U+0085). Because this is an |
| atomic group, the two-character sequence is treated as a single unit |
| that cannot be split. |
| |
| In other modes, two additional characters whose code points are greater |
| than 255 are added: LS (line separator, U+2028) and PS (paragraph sepa- |
| rator, U+2029). Unicode support is not needed for these characters to |
| be recognized. |
| |
| It is possible to restrict \R to match only CR, LF, or CRLF (instead of |
| the complete set of Unicode line endings) by setting the option |
| PCRE2_BSR_ANYCRLF at compile time. (BSR is an abbreviation for "back- |
| slash R".) This can be made the default when PCRE2 is built; if this is |
| the case, the other behaviour can be requested via the PCRE2_BSR_UNI- |
| CODE option. It is also possible to specify these settings by starting |
| a pattern string with one of the following sequences: |
| |
| (*BSR_ANYCRLF) CR, LF, or CRLF only |
| (*BSR_UNICODE) any Unicode newline sequence |
| |
| These override the default and the options given to the compiling func- |
| tion. Note that these special settings, which are not Perl-compatible, |
| are recognized only at the very start of a pattern, and that they must |
| be in upper case. If more than one of them is present, the last one is |
| used. They can be combined with a change of newline convention; for ex- |
| ample, a pattern can start with: |
| |
| (*ANY)(*BSR_ANYCRLF) |
| |
| They can also be combined with the (*UTF) or (*UCP) special sequences. |
| Inside a character class, \R is treated as an unrecognized escape se- |
| quence, and causes an error. |
| |
| Unicode character properties |
| |
| When PCRE2 is built with Unicode support (the default), three addi- |
| tional escape sequences that match characters with specific properties |
| are available. They can be used in any mode, though in 8-bit and 16-bit |
| non-UTF modes these sequences are of course limited to testing charac- |
| ters whose code points are less than U+0100 and U+10000, respectively. |
| In 32-bit non-UTF mode, code points greater than 0x10ffff (the Unicode |
| limit) may be encountered. These are all treated as being in the Un- |
| known script and with an unassigned type. |
| |
| Matching characters by Unicode property is not fast, because PCRE2 has |
| to do a multistage table lookup in order to find a character's prop- |
| erty. That is why the traditional escape sequences such as \d and \w do |
| not use Unicode properties in PCRE2 by default, though you can make |
| them do so by setting the PCRE2_UCP option or by starting the pattern |
| with (*UCP). |
| |
| The extra escape sequences that provide property support are: |
| |
| \p{xx} a character with the xx property |
| \P{xx} a character without the xx property |
| \X a Unicode extended grapheme cluster |
| |
| The property names represented by xx above are not case-sensitive, and |
| in accordance with Unicode's "loose matching" rules, spaces, hyphens, |
| and underscores are ignored. There is support for Unicode script names, |
| Unicode general category properties, "Any", which matches any character |
| (including newline), Bidi_Class, a number of binary (yes/no) proper- |
| ties, and some special PCRE2 properties (described below). Certain |
| other Perl properties such as "InMusicalSymbols" are not supported by |
| PCRE2. Note that \P{Any} does not match any characters, so always |
| causes a match failure. |
| |
| Script properties for \p and \P |
| |
| There are three different syntax forms for matching a script. Each Uni- |
| code character has a basic script and, optionally, a list of other |
| scripts ("Script Extensions") with which it is commonly used. Using the |
| Adlam script as an example, \p{sc:Adlam} matches characters whose basic |
| script is Adlam, whereas \p{scx:Adlam} matches, in addition, characters |
| that have Adlam in their extensions list. The full names "script" and |
| "script extensions" for the property types are recognized, and a equals |
| sign is an alternative to the colon. If a script name is given without |
| a property type, for example, \p{Adlam}, it is treated as \p{scx:Ad- |
| lam}. Perl changed to this interpretation at release 5.26 and PCRE2 |
| changed at release 10.40. |
| |
| Unassigned characters (and in non-UTF 32-bit mode, characters with code |
| points greater than 0x10FFFF) are assigned the "Unknown" script. Others |
| that are not part of an identified script are lumped together as "Com- |
| mon". The current list of recognized script names and their 4-character |
| abbreviations can be obtained by running this command: |
| |
| pcre2test -LS |
| |
| |
| The general category property for \p and \P |
| |
| Each character has exactly one Unicode general category property, spec- |
| ified by a two-letter abbreviation. For compatibility with Perl, nega- |
| tion can be specified by including a circumflex between the opening |
| brace and the property name. For example, \p{^Lu} is the same as |
| \P{Lu}. |
| |
| If only one letter is specified with \p or \P, it includes all the gen- |
| eral category properties that start with that letter. In this case, in |
| the absence of negation, the curly brackets in the escape sequence are |
| optional; these two examples have the same effect: |
| |
| \p{L} |
| \pL |
| |
| The following general category property codes are supported: |
| |
| C Other |
| Cc Control |
| Cf Format |
| Cn Unassigned |
| Co Private use |
| Cs Surrogate |
| |
| L Letter |
| Ll Lower case letter |
| Lm Modifier letter |
| Lo Other letter |
| Lt Title case letter |
| Lu Upper case letter |
| |
| M Mark |
| Mc Spacing mark |
| Me Enclosing mark |
| Mn Non-spacing mark |
| |
| N Number |
| Nd Decimal number |
| Nl Letter number |
| No Other number |
| |
| P Punctuation |
| Pc Connector punctuation |
| Pd Dash punctuation |
| Pe Close punctuation |
| Pf Final punctuation |
| Pi Initial punctuation |
| Po Other punctuation |
| Ps Open punctuation |
| |
| S Symbol |
| Sc Currency symbol |
| Sk Modifier symbol |
| Sm Mathematical symbol |
| So Other symbol |
| |
| Z Separator |
| Zl Line separator |
| Zp Paragraph separator |
| Zs Space separator |
| |
| The special property LC, which has the synonym L&, is also supported: |
| it matches a character that has the Lu, Ll, or Lt property, in other |
| words, a letter that is not classified as a modifier or "other". |
| |
| The Cs (Surrogate) property applies only to characters whose code |
| points are in the range U+D800 to U+DFFF. These characters are no dif- |
| ferent to any other character when PCRE2 is not in UTF mode (using the |
| 16-bit or 32-bit library). However, they are not valid in Unicode |
| strings and so cannot be tested by PCRE2 in UTF mode, unless UTF valid- |
| ity checking has been turned off (see the discussion of |
| PCRE2_NO_UTF_CHECK in the pcre2api page). |
| |
| The long synonyms for property names that Perl supports (such as |
| \p{Letter}) are not supported by PCRE2, nor is it permitted to prefix |
| any of these properties with "Is". |
| |
| No character that is in the Unicode table has the Cn (unassigned) prop- |
| erty. Instead, this property is assumed for any code point that is not |
| in the Unicode table. |
| |
| Specifying caseless matching does not affect these escape sequences. |
| For example, \p{Lu} always matches only upper case letters. This is |
| different from the behaviour of current versions of Perl. |
| |
| Binary (yes/no) properties for \p and \P |
| |
| Unicode defines a number of binary properties, that is, properties |
| whose only values are true or false. You can obtain a list of those |
| that are recognized by \p and \P, along with their abbreviations, by |
| running this command: |
| |
| pcre2test -LP |
| |
| |
| The Bidi_Class property for \p and \P |
| |
| \p{Bidi_Class:<class>} matches a character with the given class |
| \p{BC:<class>} matches a character with the given class |
| |
| The recognized classes are: |
| |
| AL Arabic letter |
| AN Arabic number |
| B paragraph separator |
| BN boundary neutral |
| CS common separator |
| EN European number |
| ES European separator |
| ET European terminator |
| FSI first strong isolate |
| L left-to-right |
| LRE left-to-right embedding |
| LRI left-to-right isolate |
| LRO left-to-right override |
| NSM non-spacing mark |
| ON other neutral |
| PDF pop directional format |
| PDI pop directional isolate |
| R right-to-left |
| RLE right-to-left embedding |
| RLI right-to-left isolate |
| RLO right-to-left override |
| S segment separator |
| WS which space |
| |
| An equals sign may be used instead of a colon. The class names are |
| case-insensitive; only the short names listed above are recognized. |
| |
| Extended grapheme clusters |
| |
| The \X escape matches any number of Unicode characters that form an |
| "extended grapheme cluster", and treats the sequence as an atomic group |
| (see below). Unicode supports various kinds of composite character by |
| giving each character a grapheme breaking property, and having rules |
| that use these properties to define the boundaries of extended grapheme |
| clusters. The rules are defined in Unicode Standard Annex 29, "Unicode |
| Text Segmentation". Unicode 11.0.0 abandoned the use of some previous |
| properties that had been used for emojis. Instead it introduced vari- |
| ous emoji-specific properties. PCRE2 uses only the Extended Picto- |
| graphic property. |
| |
| \X always matches at least one character. Then it decides whether to |
| add additional characters according to the following rules for ending a |
| cluster: |
| |
| 1. End at the end of the subject string. |
| |
| 2. Do not end between CR and LF; otherwise end after any control char- |
| acter. |
| |
| 3. Do not break Hangul (a Korean script) syllable sequences. Hangul |
| characters are of five types: L, V, T, LV, and LVT. An L character may |
| be followed by an L, V, LV, or LVT character; an LV or V character may |
| be followed by a V or T character; an LVT or T character may be fol- |
| lowed only by a T character. |
| |
| 4. Do not end before extending characters or spacing marks or the |
| "zero-width joiner" character. Characters with the "mark" property al- |
| ways have the "extend" grapheme breaking property. |
| |
| 5. Do not end after prepend characters. |
| |
| 6. Do not break within emoji modifier sequences or emoji zwj sequences. |
| That is, do not break between characters with the Extended_Pictographic |
| property. Extend and ZWJ characters are allowed between the charac- |
| ters. |
| |
| 7. Do not break within emoji flag sequences. That is, do not break be- |
| tween regional indicator (RI) characters if there are an odd number of |
| RI characters before the break point. |
| |
| 8. Otherwise, end the cluster. |
| |
| PCRE2's additional properties |
| |
| As well as the standard Unicode properties described above, PCRE2 sup- |
| ports four more that make it possible to convert traditional escape se- |
| quences such as \w and \s to use Unicode properties. PCRE2 uses these |
| non-standard, non-Perl properties internally when PCRE2_UCP is set. |
| However, they may also be used explicitly. These properties are: |
| |
| Xan Any alphanumeric character |
| Xps Any POSIX space character |
| Xsp Any Perl space character |
| Xwd Any Perl "word" character |
| |
| Xan matches characters that have either the L (letter) or the N (num- |
| ber) property. Xps matches the characters tab, linefeed, vertical tab, |
| form feed, or carriage return, and any other character that has the Z |
| (separator) property. Xsp is the same as Xps; in PCRE1 it used to ex- |
| clude vertical tab, for Perl compatibility, but Perl changed. Xwd |
| matches the same characters as Xan, plus underscore. |
| |
| There is another non-standard property, Xuc, which matches any charac- |
| ter that can be represented by a Universal Character Name in C++ and |
| other programming languages. These are the characters $, @, ` (grave |
| accent), and all characters with Unicode code points greater than or |
| equal to U+00A0, except for the surrogates U+D800 to U+DFFF. Note that |
| most base (ASCII) characters are excluded. (Universal Character Names |
| are of the form \uHHHH or \UHHHHHHHH where H is a hexadecimal digit. |
| Note that the Xuc property does not match these sequences but the char- |
| acters that they represent.) |
| |
| Resetting the match start |
| |
| In normal use, the escape sequence \K causes any previously matched |
| characters not to be included in the final matched sequence that is re- |
| turned. For example, the pattern: |
| |
| foo\Kbar |
| |
| matches "foobar", but reports that it has matched "bar". \K does not |
| interact with anchoring in any way. The pattern: |
| |
| ^foo\Kbar |
| |
| matches only when the subject begins with "foobar" (in single line |
| mode), though it again reports the matched string as "bar". This fea- |
| ture is similar to a lookbehind assertion (described below). However, |
| in this case, the part of the subject before the real match does not |
| have to be of fixed length, as lookbehind assertions do. The use of \K |
| does not interfere with the setting of captured substrings. For exam- |
| ple, when the pattern |
| |
| (foo)\Kbar |
| |
| matches "foobar", the first substring is still set to "foo". |
| |
| From version 5.32.0 Perl forbids the use of \K in lookaround asser- |
| tions. From release 10.38 PCRE2 also forbids this by default. However, |
| the PCRE2_EXTRA_ALLOW_LOOKAROUND_BSK option can be used when calling |
| pcre2_compile() to re-enable the previous behaviour. When this option |
| is set, \K is acted upon when it occurs inside positive assertions, but |
| is ignored in negative assertions. Note that when a pattern such as |
| (?=ab\K) matches, the reported start of the match can be greater than |
| the end of the match. Using \K in a lookbehind assertion at the start |
| of a pattern can also lead to odd effects. For example, consider this |
| pattern: |
| |
| (?<=\Kfoo)bar |
| |
| If the subject is "foobar", a call to pcre2_match() with a starting |
| offset of 3 succeeds and reports the matching string as "foobar", that |
| is, the start of the reported match is earlier than where the match |
| started. |
| |
| Simple assertions |
| |
| The final use of backslash is for certain simple assertions. An asser- |
| tion specifies a condition that has to be met at a particular point in |
| a match, without consuming any characters from the subject string. The |
| use of groups for more complicated assertions is described below. The |
| backslashed assertions are: |
| |
| \b matches at a word boundary |
| \B matches when not at a word boundary |
| \A matches at the start of the subject |
| \Z matches at the end of the subject |
| also matches before a newline at the end of the subject |
| \z matches only at the end of the subject |
| \G matches at the first matching position in the subject |
| |
| Inside a character class, \b has a different meaning; it matches the |
| backspace character. If any other of these assertions appears in a |
| character class, an "invalid escape sequence" error is generated. |
| |
| A word boundary is a position in the subject string where the current |
| character and the previous character do not both match \w or \W (i.e. |
| one matches \w and the other matches \W), or the start or end of the |
| string if the first or last character matches \w, respectively. When |
| PCRE2 is built with Unicode support, the meanings of \w and \W can be |
| changed by setting the PCRE2_UCP option. When this is done, it also af- |
| fects \b and \B. Neither PCRE2 nor Perl has a separate "start of word" |
| or "end of word" metasequence. However, whatever follows \b normally |
| determines which it is. For example, the fragment \ba matches "a" at |
| the start of a word. |
| |
| The \A, \Z, and \z assertions differ from the traditional circumflex |
| and dollar (described in the next section) in that they only ever match |
| at the very start and end of the subject string, whatever options are |
| set. Thus, they are independent of multiline mode. These three asser- |
| tions are not affected by the PCRE2_NOTBOL or PCRE2_NOTEOL options, |
| which affect only the behaviour of the circumflex and dollar metachar- |
| acters. However, if the startoffset argument of pcre2_match() is non- |
| zero, indicating that matching is to start at a point other than the |
| beginning of the subject, \A can never match. The difference between |
| \Z and \z is that \Z matches before a newline at the end of the string |
| as well as at the very end, whereas \z matches only at the end. |
| |
| The \G assertion is true only when the current matching position is at |
| the start point of the matching process, as specified by the startoff- |
| set argument of pcre2_match(). It differs from \A when the value of |
| startoffset is non-zero. By calling pcre2_match() multiple times with |
| appropriate arguments, you can mimic Perl's /g option, and it is in |
| this kind of implementation where \G can be useful. |
| |
| Note, however, that PCRE2's implementation of \G, being true at the |
| starting character of the matching process, is subtly different from |
| Perl's, which defines it as true at the end of the previous match. In |
| Perl, these can be different when the previously matched string was |
| empty. Because PCRE2 does just one match at a time, it cannot reproduce |
| this behaviour. |
| |
| If all the alternatives of a pattern begin with \G, the expression is |
| anchored to the starting match position, and the "anchored" flag is set |
| in the compiled regular expression. |
| |
| |
| CIRCUMFLEX AND DOLLAR |
| |
| The circumflex and dollar metacharacters are zero-width assertions. |
| That is, they test for a particular condition being true without con- |
| suming any characters from the subject string. These two metacharacters |
| are concerned with matching the starts and ends of lines. If the new- |
| line convention is set so that only the two-character sequence CRLF is |
| recognized as a newline, isolated CR and LF characters are treated as |
| ordinary data characters, and are not recognized as newlines. |
| |
| Outside a character class, in the default matching mode, the circumflex |
| character is an assertion that is true only if the current matching |
| point is at the start of the subject string. If the startoffset argu- |
| ment of pcre2_match() is non-zero, or if PCRE2_NOTBOL is set, circum- |
| flex can never match if the PCRE2_MULTILINE option is unset. Inside a |
| character class, circumflex has an entirely different meaning (see be- |
| low). |
| |
| Circumflex need not be the first character of the pattern if a number |
| of alternatives are involved, but it should be the first thing in each |
| alternative in which it appears if the pattern is ever to match that |
| branch. If all possible alternatives start with a circumflex, that is, |
| if the pattern is constrained to match only at the start of the sub- |
| ject, it is said to be an "anchored" pattern. (There are also other |
| constructs that can cause a pattern to be anchored.) |
| |
| The dollar character is an assertion that is true only if the current |
| matching point is at the end of the subject string, or immediately be- |
| fore a newline at the end of the string (by default), unless PCRE2_NO- |
| TEOL is set. Note, however, that it does not actually match the new- |
| line. Dollar need not be the last character of the pattern if a number |
| of alternatives are involved, but it should be the last item in any |
| branch in which it appears. Dollar has no special meaning in a charac- |
| ter class. |
| |
| The meaning of dollar can be changed so that it matches only at the |
| very end of the string, by setting the PCRE2_DOLLAR_ENDONLY option at |
| compile time. This does not affect the \Z assertion. |
| |
| The meanings of the circumflex and dollar metacharacters are changed if |
| the PCRE2_MULTILINE option is set. When this is the case, a dollar |
| character matches before any newlines in the string, as well as at the |
| very end, and a circumflex matches immediately after internal newlines |
| as well as at the start of the subject string. It does not match after |
| a newline that ends the string, for compatibility with Perl. However, |
| this can be changed by setting the PCRE2_ALT_CIRCUMFLEX option. |
| |
| For example, the pattern /^abc$/ matches the subject string "def\nabc" |
| (where \n represents a newline) in multiline mode, but not otherwise. |
| Consequently, patterns that are anchored in single line mode because |
| all branches start with ^ are not anchored in multiline mode, and a |
| match for circumflex is possible when the startoffset argument of |
| pcre2_match() is non-zero. The PCRE2_DOLLAR_ENDONLY option is ignored |
| if PCRE2_MULTILINE is set. |
| |
| When the newline convention (see "Newline conventions" below) recog- |
| nizes the two-character sequence CRLF as a newline, this is preferred, |
| even if the single characters CR and LF are also recognized as new- |
| lines. For example, if the newline convention is "any", a multiline |
| mode circumflex matches before "xyz" in the string "abc\r\nxyz" rather |
| than after CR, even though CR on its own is a valid newline. (It also |
| matches at the very start of the string, of course.) |
| |
| Note that the sequences \A, \Z, and \z can be used to match the start |
| and end of the subject in both modes, and if all branches of a pattern |
| start with \A it is always anchored, whether or not PCRE2_MULTILINE is |
| set. |
| |
| |
| FULL STOP (PERIOD, DOT) AND \N |
| |
| Outside a character class, a dot in the pattern matches any one charac- |
| ter in the subject string except (by default) a character that signi- |
| fies the end of a line. One or more characters may be specified as line |
| terminators (see "Newline conventions" above). |
| |
| Dot never matches a single line-ending character. When the two-charac- |
| ter sequence CRLF is the only line ending, dot does not match CR if it |
| is immediately followed by LF, but otherwise it matches all characters |
| (including isolated CRs and LFs). When ANYCRLF is selected for line |
| endings, no occurences of CR of LF match dot. When all Unicode line |
| endings are being recognized, dot does not match CR or LF or any of the |
| other line ending characters. |
| |
| The behaviour of dot with regard to newlines can be changed. If the |
| PCRE2_DOTALL option is set, a dot matches any one character, without |
| exception. If the two-character sequence CRLF is present in the sub- |
| ject string, it takes two dots to match it. |
| |
| The handling of dot is entirely independent of the handling of circum- |
| flex and dollar, the only relationship being that they both involve |
| newlines. Dot has no special meaning in a character class. |
| |
| The escape sequence \N when not followed by an opening brace behaves |
| like a dot, except that it is not affected by the PCRE2_DOTALL option. |
| In other words, it matches any character except one that signifies the |
| end of a line. |
| |
| When \N is followed by an opening brace it has a different meaning. See |
| the section entitled "Non-printing characters" above for details. Perl |
| also uses \N{name} to specify characters by Unicode name; PCRE2 does |
| not support this. |
| |
| |
| MATCHING A SINGLE CODE UNIT |
| |
| Outside a character class, the escape sequence \C matches any one code |
| unit, whether or not a UTF mode is set. In the 8-bit library, one code |
| unit is one byte; in the 16-bit library it is a 16-bit unit; in the |
| 32-bit library it is a 32-bit unit. Unlike a dot, \C always matches |
| line-ending characters. The feature is provided in Perl in order to |
| match individual bytes in UTF-8 mode, but it is unclear how it can use- |
| fully be used. |
| |
| Because \C breaks up characters into individual code units, matching |
| one unit with \C in UTF-8 or UTF-16 mode means that the rest of the |
| string may start with a malformed UTF character. This has undefined re- |
| sults, because PCRE2 assumes that it is matching character by character |
| in a valid UTF string (by default it checks the subject string's valid- |
| ity at the start of processing unless the PCRE2_NO_UTF_CHECK or |
| PCRE2_MATCH_INVALID_UTF option is used). |
| |
| An application can lock out the use of \C by setting the |
| PCRE2_NEVER_BACKSLASH_C option when compiling a pattern. It is also |
| possible to build PCRE2 with the use of \C permanently disabled. |
| |
| PCRE2 does not allow \C to appear in lookbehind assertions (described |
| below) in UTF-8 or UTF-16 modes, because this would make it impossible |
| to calculate the length of the lookbehind. Neither the alternative |
| matching function pcre2_dfa_match() nor the JIT optimizer support \C in |
| these UTF modes. The former gives a match-time error; the latter fails |
| to optimize and so the match is always run using the interpreter. |
| |
| In the 32-bit library, however, \C is always supported (when not ex- |
| plicitly locked out) because it always matches a single code unit, |
| whether or not UTF-32 is specified. |
| |
| In general, the \C escape sequence is best avoided. However, one way of |
| using it that avoids the problem of malformed UTF-8 or UTF-16 charac- |
| ters is to use a lookahead to check the length of the next character, |
| as in this pattern, which could be used with a UTF-8 string (ignore |
| white space and line breaks): |
| |
| (?| (?=[\x00-\x7f])(\C) | |
| (?=[\x80-\x{7ff}])(\C)(\C) | |
| (?=[\x{800}-\x{ffff}])(\C)(\C)(\C) | |
| (?=[\x{10000}-\x{1fffff}])(\C)(\C)(\C)(\C)) |
| |
| In this example, a group that starts with (?| resets the capturing |
| parentheses numbers in each alternative (see "Duplicate Group Numbers" |
| below). The assertions at the start of each branch check the next UTF-8 |
| character for values whose encoding uses 1, 2, 3, or 4 bytes, respec- |
| tively. The character's individual bytes are then captured by the ap- |
| propriate number of \C groups. |
| |
| |
| SQUARE BRACKETS AND CHARACTER CLASSES |
| |
| An opening square bracket introduces a character class, terminated by a |
| closing square bracket. A closing square bracket on its own is not spe- |
| cial by default. If a closing square bracket is required as a member |
| of the class, it should be the first data character in the class (after |
| an initial circumflex, if present) or escaped with a backslash. This |
| means that, by default, an empty class cannot be defined. However, if |
| the PCRE2_ALLOW_EMPTY_CLASS option is set, a closing square bracket at |
| the start does end the (empty) class. |
| |
| A character class matches a single character in the subject. A matched |
| character must be in the set of characters defined by the class, unless |
| the first character in the class definition is a circumflex, in which |
| case the subject character must not be in the set defined by the class. |
| If a circumflex is actually required as a member of the class, ensure |
| it is not the first character, or escape it with a backslash. |
| |
| For example, the character class [aeiou] matches any lower case vowel, |
| while [^aeiou] matches any character that is not a lower case vowel. |
| Note that a circumflex is just a convenient notation for specifying the |
| characters that are in the class by enumerating those that are not. A |
| class that starts with a circumflex is not an assertion; it still con- |
| sumes a character from the subject string, and therefore it fails if |
| the current pointer is at the end of the string. |
| |
| Characters in a class may be specified by their code points using \o, |
| \x, or \N{U+hh..} in the usual way. When caseless matching is set, any |
| letters in a class represent both their upper case and lower case ver- |
| sions, so for example, a caseless [aeiou] matches "A" as well as "a", |
| and a caseless [^aeiou] does not match "A", whereas a caseful version |
| would. Note that there are two ASCII characters, K and S, that, in ad- |
| dition to their lower case ASCII equivalents, are case-equivalent with |
| Unicode U+212A (Kelvin sign) and U+017F (long S) respectively when ei- |
| ther PCRE2_UTF or PCRE2_UCP is set. |
| |
| Characters that might indicate line breaks are never treated in any |
| special way when matching character classes, whatever line-ending se- |
| quence is in use, and whatever setting of the PCRE2_DOTALL and |
| PCRE2_MULTILINE options is used. A class such as [^a] always matches |
| one of these characters. |
| |
| The generic character type escape sequences \d, \D, \h, \H, \p, \P, \s, |
| \S, \v, \V, \w, and \W may appear in a character class, and add the |
| characters that they match to the class. For example, [\dABCDEF] |
| matches any hexadecimal digit. In UTF modes, the PCRE2_UCP option af- |
| fects the meanings of \d, \s, \w and their upper case partners, just as |
| it does when they appear outside a character class, as described in the |
| section entitled "Generic character types" above. The escape sequence |
| \b has a different meaning inside a character class; it matches the |
| backspace character. The sequences \B, \R, and \X are not special in- |
| side a character class. Like any other unrecognized escape sequences, |
| they cause an error. The same is true for \N when not followed by an |
| opening brace. |
| |
| The minus (hyphen) character can be used to specify a range of charac- |
| ters in a character class. For example, [d-m] matches any letter be- |
| tween d and m, inclusive. If a minus character is required in a class, |
| it must be escaped with a backslash or appear in a position where it |
| cannot be interpreted as indicating a range, typically as the first or |
| last character in the class, or immediately after a range. For example, |
| [b-d-z] matches letters in the range b to d, a hyphen character, or z. |
| |
| Perl treats a hyphen as a literal if it appears before or after a POSIX |
| class (see below) or before or after a character type escape such as as |
| \d or \H. However, unless the hyphen is the last character in the |
| class, Perl outputs a warning in its warning mode, as this is most |
| likely a user error. As PCRE2 has no facility for warning, an error is |
| given in these cases. |
| |
| It is not possible to have the literal character "]" as the end charac- |
| ter of a range. A pattern such as [W-]46] is interpreted as a class of |
| two characters ("W" and "-") followed by a literal string "46]", so it |
| would match "W46]" or "-46]". However, if the "]" is escaped with a |
| backslash it is interpreted as the end of range, so [W-\]46] is inter- |
| preted as a class containing a range followed by two other characters. |
| The octal or hexadecimal representation of "]" can also be used to end |
| a range. |
| |
| Ranges normally include all code points between the start and end char- |
| acters, inclusive. They can also be used for code points specified nu- |
| merically, for example [\000-\037]. Ranges can include any characters |
| that are valid for the current mode. In any UTF mode, the so-called |
| "surrogate" characters (those whose code points lie between 0xd800 and |
| 0xdfff inclusive) may not be specified explicitly by default (the |
| PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES option disables this check). How- |
| ever, ranges such as [\x{d7ff}-\x{e000}], which include the surrogates, |
| are always permitted. |
| |
| There is a special case in EBCDIC environments for ranges whose end |
| points are both specified as literal letters in the same case. For com- |
| patibility with Perl, EBCDIC code points within the range that are not |
| letters are omitted. For example, [h-k] matches only four characters, |
| even though the codes for h and k are 0x88 and 0x92, a range of 11 code |
| points. However, if the range is specified numerically, for example, |
| [\x88-\x92] or [h-\x92], all code points are included. |
| |
| If a range that includes letters is used when caseless matching is set, |
| it matches the letters in either case. For example, [W-c] is equivalent |
| to [][\\^_`wxyzabc], matched caselessly, and in a non-UTF mode, if |
| character tables for a French locale are in use, [\xc8-\xcb] matches |
| accented E characters in both cases. |
| |
| A circumflex can conveniently be used with the upper case character |
| types to specify a more restricted set of characters than the matching |
| lower case type. For example, the class [^\W_] matches any letter or |
| digit, but not underscore, whereas [\w] includes underscore. A positive |
| character class should be read as "something OR something OR ..." and a |
| negative class as "NOT something AND NOT something AND NOT ...". |
| |
| The only metacharacters that are recognized in character classes are |
| backslash, hyphen (only where it can be interpreted as specifying a |
| range), circumflex (only at the start), opening square bracket (only |
| when it can be interpreted as introducing a POSIX class name, or for a |
| special compatibility feature - see the next two sections), and the |
| terminating closing square bracket. However, escaping other non-al- |
| phanumeric characters does no harm. |
| |
| |
| POSIX CHARACTER CLASSES |
| |
| Perl supports the POSIX notation for character classes. This uses names |
| enclosed by [: and :] within the enclosing square brackets. PCRE2 also |
| supports this notation. For example, |
| |
| [01[:alpha:]%] |
| |
| matches "0", "1", any alphabetic character, or "%". The supported class |
| names are: |
| |
| alnum letters and digits |
| alpha letters |
| ascii character codes 0 - 127 |
| blank space or tab only |
| cntrl control characters |
| digit decimal digits (same as \d) |
| graph printing characters, excluding space |
| lower lower case letters |
| print printing characters, including space |
| punct printing characters, excluding letters and digits and space |
| space white space (the same as \s from PCRE2 8.34) |
| upper upper case letters |
| word "word" characters (same as \w) |
| xdigit hexadecimal digits |
| |
| The default "space" characters are HT (9), LF (10), VT (11), FF (12), |
| CR (13), and space (32). If locale-specific matching is taking place, |
| the list of space characters may be different; there may be fewer or |
| more of them. "Space" and \s match the same set of characters. |
| |
| The name "word" is a Perl extension, and "blank" is a GNU extension |
| from Perl 5.8. Another Perl extension is negation, which is indicated |
| by a ^ character after the colon. For example, |
| |
| [12[:^digit:]] |
| |
| matches "1", "2", or any non-digit. PCRE2 (and Perl) also recognize the |
| POSIX syntax [.ch.] and [=ch=] where "ch" is a "collating element", but |
| these are not supported, and an error is given if they are encountered. |
| |
| By default, characters with values greater than 127 do not match any of |
| the POSIX character classes, although this may be different for charac- |
| ters in the range 128-255 when locale-specific matching is happening. |
| However, if the PCRE2_UCP option is passed to pcre2_compile(), some of |
| the classes are changed so that Unicode character properties are used. |
| This is achieved by replacing certain POSIX classes with other se- |
| quences, as follows: |
| |
| [:alnum:] becomes \p{Xan} |
| [:alpha:] becomes \p{L} |
| [:blank:] becomes \h |
| [:cntrl:] becomes \p{Cc} |
| [:digit:] becomes \p{Nd} |
| [:lower:] becomes \p{Ll} |
| [:space:] becomes \p{Xps} |
| [:upper:] becomes \p{Lu} |
| [:word:] becomes \p{Xwd} |
| |
| Negated versions, such as [:^alpha:] use \P instead of \p. Three other |
| POSIX classes are handled specially in UCP mode: |
| |
| [:graph:] This matches characters that have glyphs that mark the page |
| when printed. In Unicode property terms, it matches all char- |
| acters with the L, M, N, P, S, or Cf properties, except for: |
| |
| U+061C Arabic Letter Mark |
| U+180E Mongolian Vowel Separator |
| U+2066 - U+2069 Various "isolate"s |
| |
| |
| [:print:] This matches the same characters as [:graph:] plus space |
| characters that are not controls, that is, characters with |
| the Zs property. |
| |
| [:punct:] This matches all characters that have the Unicode P (punctua- |
| tion) property, plus those characters with code points less |
| than 256 that have the S (Symbol) property. |
| |
| The other POSIX classes are unchanged, and match only characters with |
| code points less than 256. |
| |
| |
| COMPATIBILITY FEATURE FOR WORD BOUNDARIES |
| |
| In the POSIX.2 compliant library that was included in 4.4BSD Unix, the |
| ugly syntax [[:<:]] and [[:>:]] is used for matching "start of word" |
| and "end of word". PCRE2 treats these items as follows: |
| |
| [[:<:]] is converted to \b(?=\w) |
| [[:>:]] is converted to \b(?<=\w) |
| |
| Only these exact character sequences are recognized. A sequence such as |
| [a[:<:]b] provokes error for an unrecognized POSIX class name. This |
| support is not compatible with Perl. It is provided to help migrations |
| from other environments, and is best not used in any new patterns. Note |
| that \b matches at the start and the end of a word (see "Simple asser- |
| tions" above), and in a Perl-style pattern the preceding or following |
| character normally shows which is wanted, without the need for the as- |
| sertions that are used above in order to give exactly the POSIX behav- |
| iour. |
| |
| |
| VERTICAL BAR |
| |
| Vertical bar characters are used to separate alternative patterns. For |
| example, the pattern |
| |
| gilbert|sullivan |
| |
| matches either "gilbert" or "sullivan". Any number of alternatives may |
| appear, and an empty alternative is permitted (matching the empty |
| string). The matching process tries each alternative in turn, from left |
| to right, and the first one that succeeds is used. If the alternatives |
| are within a group (defined below), "succeeds" means matching the rest |
| of the main pattern as well as the alternative in the group. |
| |
| |
| INTERNAL OPTION SETTING |
| |
| The settings of the PCRE2_CASELESS, PCRE2_MULTILINE, PCRE2_DOTALL, |
| PCRE2_EXTENDED, PCRE2_EXTENDED_MORE, and PCRE2_NO_AUTO_CAPTURE options |
| can be changed from within the pattern by a sequence of letters en- |
| closed between "(?" and ")". These options are Perl-compatible, and |
| are described in detail in the pcre2api documentation. The option let- |
| ters are: |
| |
| i for PCRE2_CASELESS |
| m for PCRE2_MULTILINE |
| n for PCRE2_NO_AUTO_CAPTURE |
| s for PCRE2_DOTALL |
| x for PCRE2_EXTENDED |
| xx for PCRE2_EXTENDED_MORE |
| |
| For example, (?im) sets caseless, multiline matching. It is also possi- |
| ble to unset these options by preceding the relevant letters with a hy- |
| phen, for example (?-im). The two "extended" options are not indepen- |
| dent; unsetting either one cancels the effects of both of them. |
| |
| A combined setting and unsetting such as (?im-sx), which sets |
| PCRE2_CASELESS and PCRE2_MULTILINE while unsetting PCRE2_DOTALL and |
| PCRE2_EXTENDED, is also permitted. Only one hyphen may appear in the |
| options string. If a letter appears both before and after the hyphen, |
| the option is unset. An empty options setting "(?)" is allowed. Need- |
| less to say, it has no effect. |
| |
| If the first character following (? is a circumflex, it causes all of |
| the above options to be unset. Thus, (?^) is equivalent to (?-imnsx). |
| Letters may follow the circumflex to cause some options to be re-in- |
| stated, but a hyphen may not appear. |
| |
| The PCRE2-specific options PCRE2_DUPNAMES and PCRE2_UNGREEDY can be |
| changed in the same way as the Perl-compatible options by using the |
| characters J and U respectively. However, these are not unset by (?^). |
| |
| When one of these option changes occurs at top level (that is, not in- |
| side group parentheses), the change applies to the remainder of the |
| pattern that follows. An option change within a group (see below for a |
| description of groups) affects only that part of the group that follows |
| it, so |
| |
| (a(?i)b)c |
| |
| matches abc and aBc and no other strings (assuming PCRE2_CASELESS is |
| not used). By this means, options can be made to have different set- |
| tings in different parts of the pattern. Any changes made in one alter- |
| native do carry on into subsequent branches within the same group. For |
| example, |
| |
| (a(?i)b|c) |
| |
| matches "ab", "aB", "c", and "C", even though when matching "C" the |
| first branch is abandoned before the option setting. This is because |
| the effects of option settings happen at compile time. There would be |
| some very weird behaviour otherwise. |
| |
| As a convenient shorthand, if any option settings are required at the |
| start of a non-capturing group (see the next section), the option let- |
| ters may appear between the "?" and the ":". Thus the two patterns |
| |
| (?i:saturday|sunday) |
| (?:(?i)saturday|sunday) |
| |
| match exactly the same set of strings. |
| |
| Note: There are other PCRE2-specific options, applying to the whole |
| pattern, which can be set by the application when the compiling func- |
| tion is called. In addition, the pattern can contain special leading |
| sequences such as (*CRLF) to override what the application has set or |
| what has been defaulted. Details are given in the section entitled |
| "Newline sequences" above. There are also the (*UTF) and (*UCP) leading |
| sequences that can be used to set UTF and Unicode property modes; they |
| are equivalent to setting the PCRE2_UTF and PCRE2_UCP options, respec- |
| tively. However, the application can set the PCRE2_NEVER_UTF and |
| PCRE2_NEVER_UCP options, which lock out the use of the (*UTF) and |
| (*UCP) sequences. |
| |
| |
| GROUPS |
| |
| Groups are delimited by parentheses (round brackets), which can be |
| nested. Turning part of a pattern into a group does two things: |
| |
| 1. It localizes a set of alternatives. For example, the pattern |
| |
| cat(aract|erpillar|) |
| |
| matches "cataract", "caterpillar", or "cat". Without the parentheses, |
| it would match "cataract", "erpillar" or an empty string. |
| |
| 2. It creates a "capture group". This means that, when the whole pat- |
| tern matches, the portion of the subject string that matched the group |
| is passed back to the caller, separately from the portion that matched |
| the whole pattern. (This applies only to the traditional matching |
| function; the DFA matching function does not support capturing.) |
| |
| Opening parentheses are counted from left to right (starting from 1) to |
| obtain numbers for capture groups. For example, if the string "the red |
| king" is matched against the pattern |
| |
| the ((red|white) (king|queen)) |
| |
| the captured substrings are "red king", "red", and "king", and are num- |
| bered 1, 2, and 3, respectively. |
| |
| The fact that plain parentheses fulfil two functions is not always |
| helpful. There are often times when grouping is required without cap- |
| turing. If an opening parenthesis is followed by a question mark and a |
| colon, the group does not do any capturing, and is not counted when |
| computing the number of any subsequent capture groups. For example, if |
| the string "the white queen" is matched against the pattern |
| |
| the ((?:red|white) (king|queen)) |
| |
| the captured substrings are "white queen" and "queen", and are numbered |
| 1 and 2. The maximum number of capture groups is 65535. |
| |
| As a convenient shorthand, if any option settings are required at the |
| start of a non-capturing group, the option letters may appear between |
| the "?" and the ":". Thus the two patterns |
| |
| (?i:saturday|sunday) |
| (?:(?i)saturday|sunday) |
| |
| match exactly the same set of strings. Because alternative branches are |
| tried from left to right, and options are not reset until the end of |
| the group is reached, an option setting in one branch does affect sub- |
| sequent branches, so the above patterns match "SUNDAY" as well as "Sat- |
| urday". |
| |
| |
| DUPLICATE GROUP NUMBERS |
| |
| Perl 5.10 introduced a feature whereby each alternative in a group uses |
| the same numbers for its capturing parentheses. Such a group starts |
| with (?| and is itself a non-capturing group. For example, consider |
| this pattern: |
| |
| (?|(Sat)ur|(Sun))day |
| |
| Because the two alternatives are inside a (?| group, both sets of cap- |
| turing parentheses are numbered one. Thus, when the pattern matches, |
| you can look at captured substring number one, whichever alternative |
| matched. This construct is useful when you want to capture part, but |
| not all, of one of a number of alternatives. Inside a (?| group, paren- |
| theses are numbered as usual, but the number is reset at the start of |
| each branch. The numbers of any capturing parentheses that follow the |
| whole group start after the highest number used in any branch. The fol- |
| lowing example is taken from the Perl documentation. The numbers under- |
| neath show in which buffer the captured content will be stored. |
| |
| # before ---------------branch-reset----------- after |
| / ( a ) (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x |
| # 1 2 2 3 2 3 4 |
| |
| A backreference to a capture group uses the most recent value that is |
| set for the group. The following pattern matches "abcabc" or "defdef": |
| |
| /(?|(abc)|(def))\1/ |
| |
| In contrast, a subroutine call to a capture group always refers to the |
| first one in the pattern with the given number. The following pattern |
| matches "abcabc" or "defabc": |
| |
| /(?|(abc)|(def))(?1)/ |
| |
| A relative reference such as (?-1) is no different: it is just a conve- |
| nient way of computing an absolute group number. |
| |
| If a condition test for a group's having matched refers to a non-unique |
| number, the test is true if any group with that number has matched. |
| |
| An alternative approach to using this "branch reset" feature is to use |
| duplicate named groups, as described in the next section. |
| |
| |
| NAMED CAPTURE GROUPS |
| |
| Identifying capture groups by number is simple, but it can be very hard |
| to keep track of the numbers in complicated patterns. Furthermore, if |
| an expression is modified, the numbers may change. To help with this |
| difficulty, PCRE2 supports the naming of capture groups. This feature |
| was not added to Perl until release 5.10. Python had the feature ear- |
| lier, and PCRE1 introduced it at release 4.0, using the Python syntax. |
| PCRE2 supports both the Perl and the Python syntax. |
| |
| In PCRE2, a capture group can be named in one of three ways: |
| (?<name>...) or (?'name'...) as in Perl, or (?P<name>...) as in Python. |
| Names may be up to 32 code units long. When PCRE2_UTF is not set, they |
| may contain only ASCII alphanumeric characters and underscores, but |
| must start with a non-digit. When PCRE2_UTF is set, the syntax of group |
| names is extended to allow any Unicode letter or Unicode decimal digit. |
| In other words, group names must match one of these patterns: |
| |
| ^[_A-Za-z][_A-Za-z0-9]*\z when PCRE2_UTF is not set |
| ^[_\p{L}][_\p{L}\p{Nd}]*\z when PCRE2_UTF is set |
| |
| References to capture groups from other parts of the pattern, such as |
| backreferences, recursion, and conditions, can all be made by name as |
| well as by number. |
| |
| Named capture groups are allocated numbers as well as names, exactly as |
| if the names were not present. In both PCRE2 and Perl, capture groups |
| are primarily identified by numbers; any names are just aliases for |
| these numbers. The PCRE2 API provides function calls for extracting the |
| complete name-to-number translation table from a compiled pattern, as |
| well as convenience functions for extracting captured substrings by |
| name. |
| |
| Warning: When more than one capture group has the same number, as de- |
| scribed in the previous section, a name given to one of them applies to |
| all of them. Perl allows identically numbered groups to have different |
| names. Consider this pattern, where there are two capture groups, both |
| numbered 1: |
| |
| (?|(?<AA>aa)|(?<BB>bb)) |
| |
| Perl allows this, with both names AA and BB as aliases of group 1. |
| Thus, after a successful match, both names yield the same value (either |
| "aa" or "bb"). |
| |
| In an attempt to reduce confusion, PCRE2 does not allow the same group |
| number to be associated with more than one name. The example above pro- |
| vokes a compile-time error. However, there is still scope for confu- |
| sion. Consider this pattern: |
| |
| (?|(?<AA>aa)|(bb)) |
| |
| Although the second group number 1 is not explicitly named, the name AA |
| is still an alias for any group 1. Whether the pattern matches "aa" or |
| "bb", a reference by name to group AA yields the matched string. |
| |
| By default, a name must be unique within a pattern, except that dupli- |
| cate names are permitted for groups with the same number, for example: |
| |
| (?|(?<AA>aa)|(?<AA>bb)) |
| |
| The duplicate name constraint can be disabled by setting the PCRE2_DUP- |
| NAMES option at compile time, or by the use of (?J) within the pattern, |
| as described in the section entitled "Internal Option Setting" above. |
| |
| Duplicate names can be useful for patterns where only one instance of |
| the named capture group can match. Suppose you want to match the name |
| of a weekday, either as a 3-letter abbreviation or as the full name, |
| and in both cases you want to extract the abbreviation. This pattern |
| (ignoring the line breaks) does the job: |
| |
| (?J) |
| (?<DN>Mon|Fri|Sun)(?:day)?| |
| (?<DN>Tue)(?:sday)?| |
| (?<DN>Wed)(?:nesday)?| |
| (?<DN>Thu)(?:rsday)?| |
| (?<DN>Sat)(?:urday)? |
| |
| There are five capture groups, but only one is ever set after a match. |
| The convenience functions for extracting the data by name returns the |
| substring for the first (and in this example, the only) group of that |
| name that matched. This saves searching to find which numbered group it |
| was. (An alternative way of solving this problem is to use a "branch |
| reset" group, as described in the previous section.) |
| |
| If you make a backreference to a non-unique named group from elsewhere |
| in the pattern, the groups to which the name refers are checked in the |
| order in which they appear in the overall pattern. The first one that |
| is set is used for the reference. For example, this pattern matches |
| both "foofoo" and "barbar" but not "foobar" or "barfoo": |
| |
| (?J)(?:(?<n>foo)|(?<n>bar))\k<n> |
| |
| |
| If you make a subroutine call to a non-unique named group, the one that |
| corresponds to the first occurrence of the name is used. In the absence |
| of duplicate numbers this is the one with the lowest number. |
| |
| If you use a named reference in a condition test (see the section about |
| conditions below), either to check whether a capture group has matched, |
| or to check for recursion, all groups with the same name are tested. If |
| the condition is true for any one of them, the overall condition is |
| true. This is the same behaviour as testing by number. For further de- |
| tails of the interfaces for handling named capture groups, see the |
| pcre2api documentation. |
| |
| |
| REPETITION |
| |
| Repetition is specified by quantifiers, which can follow any of the |
| following items: |
| |
| a literal data character |
| the dot metacharacter |
| the \C escape sequence |
| the \R escape sequence |
| the \X escape sequence |
| an escape such as \d or \pL that matches a single character |
| a character class |
| a backreference |
| a parenthesized group (including lookaround assertions) |
| a subroutine call (recursive or otherwise) |
| |
| The general repetition quantifier specifies a minimum and maximum num- |
| ber of permitted matches, by giving the two numbers in curly brackets |
| (braces), separated by a comma. The numbers must be less than 65536, |
| and the first must be less than or equal to the second. For example, |
| |
| z{2,4} |
| |
| matches "zz", "zzz", or "zzzz". A closing brace on its own is not a |
| special character. If the second number is omitted, but the comma is |
| present, there is no upper limit; if the second number and the comma |
| are both omitted, the quantifier specifies an exact number of required |
| matches. Thus |
| |
| [aeiou]{3,} |
| |
| matches at least 3 successive vowels, but may match many more, whereas |
| |
| \d{8} |
| |
| matches exactly 8 digits. An opening curly bracket that appears in a |
| position where a quantifier is not allowed, or one that does not match |
| the syntax of a quantifier, is taken as a literal character. For exam- |
| ple, {,6} is not a quantifier, but a literal string of four characters. |
| |
| In UTF modes, quantifiers apply to characters rather than to individual |
| code units. Thus, for example, \x{100}{2} matches two characters, each |
| of which is represented by a two-byte sequence in a UTF-8 string. Simi- |
| larly, \X{3} matches three Unicode extended grapheme clusters, each of |
| which may be several code units long (and they may be of different |
| lengths). |
| |
| The quantifier {0} is permitted, causing the expression to behave as if |
| the previous item and the quantifier were not present. This may be use- |
| ful for capture groups that are referenced as subroutines from else- |
| where in the pattern (but see also the section entitled "Defining cap- |
| ture groups for use by reference only" below). Except for parenthesized |
| groups, items that have a {0} quantifier are omitted from the compiled |
| pattern. |
| |
| For convenience, the three most common quantifiers have single-charac- |
| ter abbreviations: |
| |
| * is equivalent to {0,} |
| + is equivalent to {1,} |
| ? is equivalent to {0,1} |
| |
| It is possible to construct infinite loops by following a group that |
| can match no characters with a quantifier that has no upper limit, for |
| example: |
| |
| (a?)* |
| |
| Earlier versions of Perl and PCRE1 used to give an error at compile |
| time for such patterns. However, because there are cases where this can |
| be useful, such patterns are now accepted, but whenever an iteration of |
| such a group matches no characters, matching moves on to the next item |
| in the pattern instead of repeatedly matching an empty string. This |
| does not prevent backtracking into any of the iterations if a subse- |
| quent item fails to match. |
| |
| By default, quantifiers are "greedy", that is, they match as much as |
| possible (up to the maximum number of permitted times), without causing |
| the rest of the pattern to fail. The classic example of where this |
| gives problems is in trying to match comments in C programs. These ap- |
| pear between /* and */ and within the comment, individual * and / char- |
| acters may appear. An attempt to match C comments by applying the pat- |
| tern |
| |
| /\*.*\*/ |
| |
| to the string |
| |
| /* first comment */ not comment /* second comment */ |
| |
| fails, because it matches the entire string owing to the greediness of |
| the .* item. However, if a quantifier is followed by a question mark, |
| it ceases to be greedy, and instead matches the minimum number of times |
| possible, so the pattern |
| |
| /\*.*?\*/ |
| |
| does the right thing with the C comments. The meaning of the various |
| quantifiers is not otherwise changed, just the preferred number of |
| matches. Do not confuse this use of question mark with its use as a |
| quantifier in its own right. Because it has two uses, it can sometimes |
| appear doubled, as in |
| |
| \d??\d |
| |
| which matches one digit by preference, but can match two if that is the |
| only way the rest of the pattern matches. |
| |
| If the PCRE2_UNGREEDY option is set (an option that is not available in |
| Perl), the quantifiers are not greedy by default, but individual ones |
| can be made greedy by following them with a question mark. In other |
| words, it inverts the default behaviour. |
| |
| When a parenthesized group is quantified with a minimum repeat count |
| that is greater than 1 or with a limited maximum, more memory is re- |
| quired for the compiled pattern, in proportion to the size of the mini- |
| mum or maximum. |
| |
| If a pattern starts with .* or .{0,} and the PCRE2_DOTALL option |
| (equivalent to Perl's /s) is set, thus allowing the dot to match new- |
| lines, the pattern is implicitly anchored, because whatever follows |
| will be tried against every character position in the subject string, |
| so there is no point in retrying the overall match at any position af- |
| ter the first. PCRE2 normally treats such a pattern as though it were |
| preceded by \A. |
| |
| In cases where it is known that the subject string contains no new- |
| lines, it is worth setting PCRE2_DOTALL in order to obtain this opti- |
| mization, or alternatively, using ^ to indicate anchoring explicitly. |
| |
| However, there are some cases where the optimization cannot be used. |
| When .* is inside capturing parentheses that are the subject of a |
| backreference elsewhere in the pattern, a match at the start may fail |
| where a later one succeeds. Consider, for example: |
| |
| (.*)abc\1 |
| |
| If the subject is "xyz123abc123" the match point is the fourth charac- |
| ter. For this reason, such a pattern is not implicitly anchored. |
| |
| Another case where implicit anchoring is not applied is when the lead- |
| ing .* is inside an atomic group. Once again, a match at the start may |
| fail where a later one succeeds. Consider this pattern: |
| |
| (?>.*?a)b |
| |
| It matches "ab" in the subject "aab". The use of the backtracking con- |
| trol verbs (*PRUNE) and (*SKIP) also disable this optimization, and |
| there is an option, PCRE2_NO_DOTSTAR_ANCHOR, to do so explicitly. |
| |
| When a capture group is repeated, the value captured is the substring |
| that matched the final iteration. For example, after |
| |
| (tweedle[dume]{3}\s*)+ |
| |
| has matched "tweedledum tweedledee" the value of the captured substring |
| is "tweedledee". However, if there are nested capture groups, the cor- |
| responding captured values may have been set in previous iterations. |
| For example, after |
| |
| (a|(b))+ |
| |
| matches "aba" the value of the second captured substring is "b". |
| |
| |
| ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS |
| |
| With both maximizing ("greedy") and minimizing ("ungreedy" or "lazy") |
| repetition, failure of what follows normally causes the repeated item |
| to be re-evaluated to see if a different number of repeats allows the |
| rest of the pattern to match. Sometimes it is useful to prevent this, |
| either to change the nature of the match, or to cause it fail earlier |
| than it otherwise might, when the author of the pattern knows there is |
| no point in carrying on. |
| |
| Consider, for example, the pattern \d+foo when applied to the subject |
| line |
| |
| 123456bar |
| |
| After matching all 6 digits and then failing to match "foo", the normal |
| action of the matcher is to try again with only 5 digits matching the |
| \d+ item, and then with 4, and so on, before ultimately failing. |
| "Atomic grouping" (a term taken from Jeffrey Friedl's book) provides |
| the means for specifying that once a group has matched, it is not to be |
| re-evaluated in this way. |
| |
| If we use atomic grouping for the previous example, the matcher gives |
| up immediately on failing to match "foo" the first time. The notation |
| is a kind of special parenthesis, starting with (?> as in this example: |
| |
| (?>\d+)foo |
| |
| Perl 5.28 introduced an experimental alphabetic form starting with (* |
| which may be easier to remember: |
| |
| (*atomic:\d+)foo |
| |
| This kind of parenthesized group "locks up" the part of the pattern it |
| contains once it has matched, and a failure further into the pattern is |
| prevented from backtracking into it. Backtracking past it to previous |
| items, however, works as normal. |
| |
| An alternative description is that a group of this type matches exactly |
| the string of characters that an identical standalone pattern would |
| match, if anchored at the current point in the subject string. |
| |
| Atomic groups are not capture groups. Simple cases such as the above |
| example can be thought of as a maximizing repeat that must swallow ev- |
| erything it can. So, while both \d+ and \d+? are prepared to adjust |
| the number of digits they match in order to make the rest of the pat- |
| tern match, (?>\d+) can only match an entire sequence of digits. |
| |
| Atomic groups in general can of course contain arbitrarily complicated |
| expressions, and can be nested. However, when the contents of an atomic |
| group is just a single repeated item, as in the example above, a sim- |
| pler notation, called a "possessive quantifier" can be used. This con- |
| sists of an additional + character following a quantifier. Using this |
| notation, the previous example can be rewritten as |
| |
| \d++foo |
| |
| Note that a possessive quantifier can be used with an entire group, for |
| example: |
| |
| (abc|xyz){2,3}+ |
| |
| Possessive quantifiers are always greedy; the setting of the PCRE2_UN- |
| GREEDY option is ignored. They are a convenient notation for the sim- |
| pler forms of atomic group. However, there is no difference in the |
| meaning of a possessive quantifier and the equivalent atomic group, |
| though there may be a performance difference; possessive quantifiers |
| should be slightly faster. |
| |
| The possessive quantifier syntax is an extension to the Perl 5.8 syn- |
| tax. Jeffrey Friedl originated the idea (and the name) in the first |
| edition of his book. Mike McCloskey liked it, so implemented it when he |
| built Sun's Java package, and PCRE1 copied it from there. It found its |
| way into Perl at release 5.10. |
| |
| PCRE2 has an optimization that automatically "possessifies" certain |
| simple pattern constructs. For example, the sequence A+B is treated as |
| A++B because there is no point in backtracking into a sequence of A's |
| when B must follow. This feature can be disabled by the PCRE2_NO_AUTO- |
| POSSESS option, or starting the pattern with (*NO_AUTO_POSSESS). |
| |
| When a pattern contains an unlimited repeat inside a group that can it- |
| self be repeated an unlimited number of times, the use of an atomic |
| group is the only way to avoid some failing matches taking a very long |
| time indeed. The pattern |
| |
| (\D+|<\d+>)*[!?] |
| |
| matches an unlimited number of substrings that either consist of non- |
| digits, or digits enclosed in <>, followed by either ! or ?. When it |
| matches, it runs quickly. However, if it is applied to |
| |
| aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa |
| |
| it takes a long time before reporting failure. This is because the |
| string can be divided between the internal \D+ repeat and the external |
| * repeat in a large number of ways, and all have to be tried. (The ex- |
| ample uses [!?] rather than a single character at the end, because both |
| PCRE2 and Perl have an optimization that allows for fast failure when a |
| single character is used. They remember the last single character that |
| is required for a match, and fail early if it is not present in the |
| string.) If the pattern is changed so that it uses an atomic group, |
| like this: |
| |
| ((?>\D+)|<\d+>)*[!?] |
| |
| sequences of non-digits cannot be broken, and failure happens quickly. |
| |
| |
| BACKREFERENCES |
| |
| Outside a character class, a backslash followed by a digit greater than |
| 0 (and possibly further digits) is a backreference to a capture group |
| earlier (that is, to its left) in the pattern, provided there have been |
| that many previous capture groups. |
| |
| However, if the decimal number following the backslash is less than 8, |
| it is always taken as a backreference, and causes an error only if |
| there are not that many capture groups in the entire pattern. In other |
| words, the group that is referenced need not be to the left of the ref- |
| erence for numbers less than 8. A "forward backreference" of this type |
| can make sense when a repetition is involved and the group to the right |
| has participated in an earlier iteration. |
| |
| It is not possible to have a numerical "forward backreference" to a |
| group whose number is 8 or more using this syntax because a sequence |
| such as \50 is interpreted as a character defined in octal. See the |
| subsection entitled "Non-printing characters" above for further details |
| of the handling of digits following a backslash. Other forms of back- |
| referencing do not suffer from this restriction. In particular, there |
| is no problem when named capture groups are used (see below). |
| |
| Another way of avoiding the ambiguity inherent in the use of digits |
| following a backslash is to use the \g escape sequence. This escape |
| must be followed by a signed or unsigned number, optionally enclosed in |
| braces. These examples are all identical: |
| |
| (ring), \1 |
| (ring), \g1 |
| (ring), \g{1} |
| |
| An unsigned number specifies an absolute reference without the ambigu- |
| ity that is present in the older syntax. It is also useful when literal |
| digits follow the reference. A signed number is a relative reference. |
| Consider this example: |
| |
| (abc(def)ghi)\g{-1} |
| |
| The sequence \g{-1} is a reference to the most recently started capture |
| group before \g, that is, is it equivalent to \2 in this example. Simi- |
| larly, \g{-2} would be equivalent to \1. The use of relative references |
| can be helpful in long patterns, and also in patterns that are created |
| by joining together fragments that contain references within them- |
| selves. |
| |
| The sequence \g{+1} is a reference to the next capture group. This kind |
| of forward reference can be useful in patterns that repeat. Perl does |
| not support the use of + in this way. |
| |
| A backreference matches whatever actually most recently matched the |
| capture group in the current subject string, rather than anything at |
| all that matches the group (see "Groups as subroutines" below for a way |
| of doing that). So the pattern |
| |
| (sens|respons)e and \1ibility |
| |
| matches "sense and sensibility" and "response and responsibility", but |
| not "sense and responsibility". If caseful matching is in force at the |
| time of the backreference, the case of letters is relevant. For exam- |
| ple, |
| |
| ((?i)rah)\s+\1 |
| |
| matches "rah rah" and "RAH RAH", but not "RAH rah", even though the |
| original capture group is matched caselessly. |
| |
| There are several different ways of writing backreferences to named |
| capture groups. The .NET syntax \k{name} and the Perl syntax \k<name> |
| or \k'name' are supported, as is the Python syntax (?P=name). Perl |
| 5.10's unified backreference syntax, in which \g can be used for both |
| numeric and named references, is also supported. We could rewrite the |
| above example in any of the following ways: |
| |
| (?<p1>(?i)rah)\s+\k<p1> |
| (?'p1'(?i)rah)\s+\k{p1} |
| (?P<p1>(?i)rah)\s+(?P=p1) |
| (?<p1>(?i)rah)\s+\g{p1} |
| |
| A capture group that is referenced by name may appear in the pattern |
| before or after the reference. |
| |
| There may be more than one backreference to the same group. If a group |
| has not actually been used in a particular match, backreferences to it |
| always fail by default. For example, the pattern |
| |
| (a|(bc))\2 |
| |
| always fails if it starts to match "a" rather than "bc". However, if |
| the PCRE2_MATCH_UNSET_BACKREF option is set at compile time, a backref- |
| erence to an unset value matches an empty string. |
| |
| Because there may be many capture groups in a pattern, all digits fol- |
| lowing a backslash are taken as part of a potential backreference num- |
| ber. If the pattern continues with a digit character, some delimiter |
| must be used to terminate the backreference. If the PCRE2_EXTENDED or |
| PCRE2_EXTENDED_MORE option is set, this can be white space. Otherwise, |
| the \g{} syntax or an empty comment (see "Comments" below) can be used. |
| |
| Recursive backreferences |
| |
| A backreference that occurs inside the group to which it refers fails |
| when the group is first used, so, for example, (a\1) never matches. |
| However, such references can be useful inside repeated groups. For ex- |
| ample, the pattern |
| |
| (a|b\1)+ |
| |
| matches any number of "a"s and also "aba", "ababbaa" etc. At each iter- |
| ation of the group, the backreference matches the character string cor- |
| responding to the previous iteration. In order for this to work, the |
| pattern must be such that the first iteration does not need to match |
| the backreference. This can be done using alternation, as in the exam- |
| ple above, or by a quantifier with a minimum of zero. |
| |
| For versions of PCRE2 less than 10.25, backreferences of this type used |
| to cause the group that they reference to be treated as an atomic |
| group. This restriction no longer applies, and backtracking into such |
| groups can occur as normal. |
| |
| |
| ASSERTIONS |
| |
| An assertion is a test on the characters following or preceding the |
| current matching point that does not consume any characters. The simple |
| assertions coded as \b, \B, \A, \G, \Z, \z, ^ and $ are described |
| above. |
| |
| More complicated assertions are coded as parenthesized groups. There |
| are two kinds: those that look ahead of the current position in the |
| subject string, and those that look behind it, and in each case an as- |
| sertion may be positive (must match for the assertion to be true) or |
| negative (must not match for the assertion to be true). An assertion |
| group is matched in the normal way, and if it is true, matching contin- |
| ues after it, but with the matching position in the subject string re- |
| set to what it was before the assertion was processed. |
| |
| The Perl-compatible lookaround assertions are atomic. If an assertion |
| is true, but there is a subsequent matching failure, there is no back- |
| tracking into the assertion. However, there are some cases where non- |
| atomic assertions can be useful. PCRE2 has some support for these, de- |
| scribed in the section entitled "Non-atomic assertions" below, but they |
| are not Perl-compatible. |
| |
| A lookaround assertion may appear as the condition in a conditional |
| group (see below). In this case, the result of matching the assertion |
| determines which branch of the condition is followed. |
| |
| Assertion groups are not capture groups. If an assertion contains cap- |
| ture groups within it, these are counted for the purposes of numbering |
| the capture groups in the whole pattern. Within each branch of an as- |
| sertion, locally captured substrings may be referenced in the usual |
| way. For example, a sequence such as (.)\g{-1} can be used to check |
| that two adjacent characters are the same. |
| |
| When a branch within an assertion fails to match, any substrings that |
| were captured are discarded (as happens with any pattern branch that |
| fails to match). A negative assertion is true only when all its |
| branches fail to match; this means that no captured substrings are ever |
| retained after a successful negative assertion. When an assertion con- |
| tains a matching branch, what happens depends on the type of assertion. |
| |
| For a positive assertion, internally captured substrings in the suc- |
| cessful branch are retained, and matching continues with the next pat- |
| tern item after the assertion. For a negative assertion, a matching |
| branch means that the assertion is not true. If such an assertion is |
| being used as a condition in a conditional group (see below), captured |
| substrings are retained, because matching continues with the "no" |
| branch of the condition. For other failing negative assertions, control |
| passes to the previous backtracking point, thus discarding any captured |
| strings within the assertion. |
| |
| Most assertion groups may be repeated; though it makes no sense to as- |
| sert the same thing several times, the side effect of capturing in pos- |
| itive assertions may occasionally be useful. However, an assertion that |
| forms the condition for a conditional group may not be quantified. |
| PCRE2 used to restrict the repetition of assertions, but from release |
| 10.35 the only restriction is that an unlimited maximum repetition is |
| changed to be one more than the minimum. For example, {3,} is treated |
| as {3,4}. |
| |
| Alphabetic assertion names |
| |
| Traditionally, symbolic sequences such as (?= and (?<= have been used |
| to specify lookaround assertions. Perl 5.28 introduced some experimen- |
| tal alphabetic alternatives which might be easier to remember. They all |
| start with (* instead of (? and must be written using lower case let- |
| ters. PCRE2 supports the following synonyms: |
| |
| (*positive_lookahead: or (*pla: is the same as (?= |
| (*negative_lookahead: or (*nla: is the same as (?! |
| (*positive_lookbehind: or (*plb: is the same as (?<= |
| (*negative_lookbehind: or (*nlb: is the same as (?<! |
| |
| For example, (*pla:foo) is the same assertion as (?=foo). In the fol- |
| lowing sections, the various assertions are described using the origi- |
| nal symbolic forms. |
| |
| Lookahead assertions |
| |
| Lookahead assertions start with (?= for positive assertions and (?! for |
| negative assertions. For example, |
| |
| \w+(?=;) |
| |
| matches a word followed by a semicolon, but does not include the semi- |
| colon in the match, and |
| |
| foo(?!bar) |
| |
| matches any occurrence of "foo" that is not followed by "bar". Note |
| that the apparently similar pattern |
| |
| (?!foo)bar |
| |
| does not find an occurrence of "bar" that is preceded by something |
| other than "foo"; it finds any occurrence of "bar" whatsoever, because |
| the assertion (?!foo) is always true when the next three characters are |
| "bar". A lookbehind assertion is needed to achieve the other effect. |
| |
| If you want to force a matching failure at some point in a pattern, the |
| most convenient way to do it is with (?!) because an empty string al- |
| ways matches, so an assertion that requires there not to be an empty |
| string must always fail. The backtracking control verb (*FAIL) or (*F) |
| is a synonym for (?!). |
| |
| Lookbehind assertions |
| |
| Lookbehind assertions start with (?<= for positive assertions and (?<! |
| for negative assertions. For example, |
| |
| (?<!foo)bar |
| |
| does find an occurrence of "bar" that is not preceded by "foo". The |
| contents of a lookbehind assertion are restricted such that all the |
| strings it matches must have a fixed length. However, if there are sev- |
| eral top-level alternatives, they do not all have to have the same |
| fixed length. Thus |
| |
| (?<=bullock|donkey) |
| |
| is permitted, but |
| |
| (?<!dogs?|cats?) |
| |
| causes an error at compile time. Branches that match different length |
| strings are permitted only at the top level of a lookbehind assertion. |
| This is an extension compared with Perl, which requires all branches to |
| match the same length of string. An assertion such as |
| |
| (?<=ab(c|de)) |
| |
| is not permitted, because its single top-level branch can match two |
| different lengths, but it is acceptable to PCRE2 if rewritten to use |
| two top-level branches: |
| |
| (?<=abc|abde) |
| |
| In some cases, the escape sequence \K (see above) can be used instead |
| of a lookbehind assertion to get round the fixed-length restriction. |
| |
| The implementation of lookbehind assertions is, for each alternative, |
| to temporarily move the current position back by the fixed length and |
| then try to match. If there are insufficient characters before the cur- |
| rent position, the assertion fails. |
| |
| In UTF-8 and UTF-16 modes, PCRE2 does not allow the \C escape (which |
| matches a single code unit even in a UTF mode) to appear in lookbehind |
| assertions, because it makes it impossible to calculate the length of |
| the lookbehind. The \X and \R escapes, which can match different num- |
| bers of code units, are never permitted in lookbehinds. |
| |
| "Subroutine" calls (see below) such as (?2) or (?&X) are permitted in |
| lookbehinds, as long as the called capture group matches a fixed-length |
| string. However, recursion, that is, a "subroutine" call into a group |
| that is already active, is not supported. |
| |
| Perl does not support backreferences in lookbehinds. PCRE2 does support |
| them, but only if certain conditions are met. The PCRE2_MATCH_UN- |
| SET_BACKREF option must not be set, there must be no use of (?| in the |
| pattern (it creates duplicate group numbers), and if the backreference |
| is by name, the name must be unique. Of course, the referenced group |
| must itself match a fixed length substring. The following pattern |
| matches words containing at least two characters that begin and end |
| with the same character: |
| |
| \b(\w)\w++(?<=\1) |
| |
| Possessive quantifiers can be used in conjunction with lookbehind as- |
| sertions to specify efficient matching of fixed-length strings at the |
| end of subject strings. Consider a simple pattern such as |
| |
| abcd$ |
| |
| when applied to a long string that does not match. Because matching |
| proceeds from left to right, PCRE2 will look for each "a" in the sub- |
| ject and then see if what follows matches the rest of the pattern. If |
| the pattern is specified as |
| |
| ^.*abcd$ |
| |
| the initial .* matches the entire string at first, but when this fails |
| (because there is no following "a"), it backtracks to match all but the |
| last character, then all but the last two characters, and so on. Once |
| again the search for "a" covers the entire string, from right to left, |
| so we are no better off. However, if the pattern is written as |
| |
| ^.*+(?<=abcd) |
| |
| there can be no backtracking for the .*+ item because of the possessive |
| quantifier; it can match only the entire string. The subsequent lookbe- |
| hind assertion does a single test on the last four characters. If it |
| fails, the match fails immediately. For long strings, this approach |
| makes a significant difference to the processing time. |
| |
| Using multiple assertions |
| |
| Several assertions (of any sort) may occur in succession. For example, |
| |
| (?<=\d{3})(?<!999)foo |
| |
| matches "foo" preceded by three digits that are not "999". Notice that |
| each of the assertions is applied independently at the same point in |
| the subject string. First there is a check that the previous three |
| characters are all digits, and then there is a check that the same |
| three characters are not "999". This pattern does not match "foo" pre- |
| ceded by six characters, the first of which are digits and the last |
| three of which are not "999". For example, it doesn't match "123abc- |
| foo". A pattern to do that is |
| |
| (?<=\d{3}...)(?<!999)foo |
| |
| This time the first assertion looks at the preceding six characters, |
| checking that the first three are digits, and then the second assertion |
| checks that the preceding three characters are not "999". |
| |
| Assertions can be nested in any combination. For example, |
| |
| (?<=(?<!foo)bar)baz |
| |
| matches an occurrence of "baz" that is preceded by "bar" which in turn |
| is not preceded by "foo", while |
| |
| (?<=\d{3}(?!999)...)foo |
| |
| is another pattern that matches "foo" preceded by three digits and any |
| three characters that are not "999". |
| |
| |
| NON-ATOMIC ASSERTIONS |
| |
| The traditional Perl-compatible lookaround assertions are atomic. That |
| is, if an assertion is true, but there is a subsequent matching fail- |
| ure, there is no backtracking into the assertion. However, there are |
| some cases where non-atomic positive assertions can be useful. PCRE2 |
| provides these using the following syntax: |
| |
| (*non_atomic_positive_lookahead: or (*napla: or (?* |
| (*non_atomic_positive_lookbehind: or (*naplb: or (?<* |
| |
| Consider the problem of finding the right-most word in a string that |
| also appears earlier in the string, that is, it must appear at least |
| twice in total. This pattern returns the required result as captured |
| substring 1: |
| |
| ^(?x)(*napla: .* \b(\w++)) (?> .*? \b\1\b ){2} |
| |
| For a subject such as "word1 word2 word3 word2 word3 word4" the result |
| is "word3". How does it work? At the start, ^(?x) anchors the pattern |
| and sets the "x" option, which causes white space (introduced for read- |
| ability) to be ignored. Inside the assertion, the greedy .* at first |
| consumes the entire string, but then has to backtrack until the rest of |
| the assertion can match a word, which is captured by group 1. In other |
| words, when the assertion first succeeds, it captures the right-most |
| word in the string. |
| |
| The current matching point is then reset to the start of the subject, |
| and the rest of the pattern match checks for two occurrences of the |
| captured word, using an ungreedy .*? to scan from the left. If this |
| succeeds, we are done, but if the last word in the string does not oc- |
| cur twice, this part of the pattern fails. If a traditional atomic |
| lookhead (?= or (*pla: had been used, the assertion could not be re-en- |
| tered, and the whole match would fail. The pattern would succeed only |
| if the very last word in the subject was found twice. |
| |
| Using a non-atomic lookahead, however, means that when the last word |
| does not occur twice in the string, the lookahead can backtrack and |
| find the second-last word, and so on, until either the match succeeds, |
| or all words have been tested. |
| |
| Two conditions must be met for a non-atomic assertion to be useful: the |
| contents of one or more capturing groups must change after a backtrack |
| into the assertion, and there must be a backreference to a changed |
| group later in the pattern. If this is not the case, the rest of the |
| pattern match fails exactly as before because nothing has changed, so |
| using a non-atomic assertion just wastes resources. |
| |
| There is one exception to backtracking into a non-atomic assertion. If |
| an (*ACCEPT) control verb is triggered, the assertion succeeds atomi- |
| cally. That is, a subsequent match failure cannot backtrack into the |
| assertion. |
| |
| Non-atomic assertions are not supported by the alternative matching |
| function pcre2_dfa_match(). They are supported by JIT, but only if they |
| do not contain any control verbs such as (*ACCEPT). (This may change in |
| future). Note that assertions that appear as conditions for conditional |
| groups (see below) must be atomic. |
| |
| |
| SCRIPT RUNS |
| |
| In concept, a script run is a sequence of characters that are all from |
| the same Unicode script such as Latin or Greek. However, because some |
| scripts are commonly used together, and because some diacritical and |
| other marks are used with multiple scripts, it is not that simple. |
| There is a full description of the rules that PCRE2 uses in the section |
| entitled "Script Runs" in the pcre2unicode documentation. |
| |
| If part of a pattern is enclosed between (*script_run: or (*sr: and a |
| closing parenthesis, it fails if the sequence of characters that it |
| matches are not a script run. After a failure, normal backtracking oc- |
| curs. Script runs can be used to detect spoofing attacks using charac- |
| ters that look the same, but are from different scripts. The string |
| "paypal.com" is an infamous example, where the letters could be a mix- |
| ture of Latin and Cyrillic. This pattern ensures that the matched char- |
| acters in a sequence of non-spaces that follow white space are a script |
| run: |
| |
| \s+(*sr:\S+) |
| |
| To be sure that they are all from the Latin script (for example), a |
| lookahead can be used: |
| |
| \s+(?=\p{Latin})(*sr:\S+) |
| |
| This works as long as the first character is expected to be a character |
| in that script, and not (for example) punctuation, which is allowed |
| with any script. If this is not the case, a more creative lookahead is |
| needed. For example, if digits, underscore, and dots are permitted at |
| the start: |
| |
| \s+(?=[0-9_.]*\p{Latin})(*sr:\S+) |
| |
| |
| In many cases, backtracking into a script run pattern fragment is not |
| desirable. The script run can employ an atomic group to prevent this. |
| Because this is a common requirement, a shorthand notation is provided |
| by (*atomic_script_run: or (*asr: |
| |
| (*asr:...) is the same as (*sr:(?>...)) |
| |
| Note that the atomic group is inside the script run. Putting it outside |
| would not prevent backtracking into the script run pattern. |
| |
| Support for script runs is not available if PCRE2 is compiled without |
| Unicode support. A compile-time error is given if any of the above con- |
| structs is encountered. Script runs are not supported by the alternate |
| matching function, pcre2_dfa_match() because they use the same mecha- |
| nism as capturing parentheses. |
| |
| Warning: The (*ACCEPT) control verb (see below) should not be used |
| within a script run group, because it causes an immediate exit from the |
| group, bypassing the script run checking. |
| |
| |
| CONDITIONAL GROUPS |
| |
| It is possible to cause the matching process to obey a pattern fragment |
| conditionally or to choose between two alternative fragments, depending |
| on the result of an assertion, or whether a specific capture group has |
| already been matched. The two possible forms of conditional group are: |
| |
| (?(condition)yes-pattern) |
| (?(condition)yes-pattern|no-pattern) |
| |
| If the condition is satisfied, the yes-pattern is used; otherwise the |
| no-pattern (if present) is used. An absent no-pattern is equivalent to |
| an empty string (it always matches). If there are more than two alter- |
| natives in the group, a compile-time error occurs. Each of the two al- |
| ternatives may itself contain nested groups of any form, including con- |
| ditional groups; the restriction to two alternatives applies only at |
| the level of the condition itself. This pattern fragment is an example |
| where the alternatives are complex: |
| |
| (?(1) (A|B|C) | (D | (?(2)E|F) | E) ) |
| |
| |
| There are five kinds of condition: references to capture groups, refer- |
| ences to recursion, two pseudo-conditions called DEFINE and VERSION, |
| and assertions. |
| |
| Checking for a used capture group by number |
| |
| If the text between the parentheses consists of a sequence of digits, |
| the condition is true if a capture group of that number has previously |
| matched. If there is more than one capture group with the same number |
| (see the earlier section about duplicate group numbers), the condition |
| is true if any of them have matched. An alternative notation is to pre- |
| cede the digits with a plus or minus sign. In this case, the group num- |
| ber is relative rather than absolute. The most recently opened capture |
| group can be referenced by (?(-1), the next most recent by (?(-2), and |
| so on. Inside loops it can also make sense to refer to subsequent |
| groups. The next capture group can be referenced as (?(+1), and so on. |
| (The value zero in any of these forms is not used; it provokes a com- |
| pile-time error.) |
| |
| Consider the following pattern, which contains non-significant white |
| space to make it more readable (assume the PCRE2_EXTENDED option) and |
| to divide it into three parts for ease of discussion: |
| |
| ( \( )? [^()]+ (?(1) \) ) |
| |
| The first part matches an optional opening parenthesis, and if that |
| character is present, sets it as the first captured substring. The sec- |
| ond part matches one or more characters that are not parentheses. The |
| third part is a conditional group that tests whether or not the first |
| capture group matched. If it did, that is, if subject started with an |
| opening parenthesis, the condition is true, and so the yes-pattern is |
| executed and a closing parenthesis is required. Otherwise, since no- |
| pattern is not present, the conditional group matches nothing. In other |
| words, this pattern matches a sequence of non-parentheses, optionally |
| enclosed in parentheses. |
| |
| If you were embedding this pattern in a larger one, you could use a |
| relative reference: |
| |
| ...other stuff... ( \( )? [^()]+ (?(-1) \) ) ... |
| |
| This makes the fragment independent of the parentheses in the larger |
| pattern. |
| |
| Checking for a used capture group by name |
| |
| Perl uses the syntax (?(<name>)...) or (?('name')...) to test for a |
| used capture group by name. For compatibility with earlier versions of |
| PCRE1, which had this facility before Perl, the syntax (?(name)...) is |
| also recognized. Note, however, that undelimited names consisting of |
| the letter R followed by digits are ambiguous (see the following sec- |
| tion). Rewriting the above example to use a named group gives this: |
| |
| (?<OPEN> \( )? [^()]+ (?(<OPEN>) \) ) |
| |
| If the name used in a condition of this kind is a duplicate, the test |
| is applied to all groups of the same name, and is true if any one of |
| them has matched. |
| |
| Checking for pattern recursion |
| |
| "Recursion" in this sense refers to any subroutine-like call from one |
| part of the pattern to another, whether or not it is actually recur- |
| sive. See the sections entitled "Recursive patterns" and "Groups as |
| subroutines" below for details of recursion and subroutine calls. |
| |
| If a condition is the string (R), and there is no capture group with |
| the name R, the condition is true if matching is currently in a recur- |
| sion or subroutine call to the whole pattern or any capture group. If |
| digits follow the letter R, and there is no group with that name, the |
| condition is true if the most recent call is into a group with the |
| given number, which must exist somewhere in the overall pattern. This |
| is a contrived example that is equivalent to a+b: |
| |
| ((?(R1)a+|(?1)b)) |
| |
| However, in both cases, if there is a capture group with a matching |
| name, the condition tests for its being set, as described in the sec- |
| tion above, instead of testing for recursion. For example, creating a |
| group with the name R1 by adding (?<R1>) to the above pattern com- |
| pletely changes its meaning. |
| |
| If a name preceded by ampersand follows the letter R, for example: |
| |
| (?(R&name)...) |
| |
| the condition is true if the most recent recursion is into a group of |
| that name (which must exist within the pattern). |
| |
| This condition does not check the entire recursion stack. It tests only |
| the current level. If the name used in a condition of this kind is a |
| duplicate, the test is applied to all groups of the same name, and is |
| true if any one of them is the most recent recursion. |
| |
| At "top level", all these recursion test conditions are false. |
| |
| Defining capture groups for use by reference only |
| |
| If the condition is the string (DEFINE), the condition is always false, |
| even if there is a group with the name DEFINE. In this case, there may |
| be only one alternative in the rest of the conditional group. It is al- |
| ways skipped if control reaches this point in the pattern; the idea of |
| DEFINE is that it can be used to define subroutines that can be refer- |
| enced from elsewhere. (The use of subroutines is described below.) For |
| example, a pattern to match an IPv4 address such as "192.168.23.245" |
| could be written like this (ignore white space and line breaks): |
| |
| (?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) ) |
| \b (?&byte) (\.(?&byte)){3} \b |
| |
| The first part of the pattern is a DEFINE group inside which another |
| group named "byte" is defined. This matches an individual component of |
| an IPv4 address (a number less than 256). When matching takes place, |
| this part of the pattern is skipped because DEFINE acts like a false |
| condition. The rest of the pattern uses references to the named group |
| to match the four dot-separated components of an IPv4 address, insist- |
| ing on a word boundary at each end. |
| |
| Checking the PCRE2 version |
| |
| Programs that link with a PCRE2 library can check the version by call- |
| ing pcre2_config() with appropriate arguments. Users of applications |
| that do not have access to the underlying code cannot do this. A spe- |
| cial "condition" called VERSION exists to allow such users to discover |
| which version of PCRE2 they are dealing with by using this condition to |
| match a string such as "yesno". VERSION must be followed either by "=" |
| or ">=" and a version number. For example: |
| |
| (?(VERSION>=10.4)yes|no) |
| |
| This pattern matches "yes" if the PCRE2 version is greater or equal to |
| 10.4, or "no" otherwise. The fractional part of the version number may |
| not contain more than two digits. |
| |
| Assertion conditions |
| |
| If the condition is not in any of the above formats, it must be a |
| parenthesized assertion. This may be a positive or negative lookahead |
| or lookbehind assertion. However, it must be a traditional atomic as- |
| sertion, not one of the PCRE2-specific non-atomic assertions. |
| |
| Consider this pattern, again containing non-significant white space, |
| and with the two alternatives on the second line: |
| |
| (?(?=[^a-z]*[a-z]) |
| \d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} ) |
| |
| The condition is a positive lookahead assertion that matches an op- |
| tional sequence of non-letters followed by a letter. In other words, it |
| tests for the presence of at least one letter in the subject. If a let- |
| ter is found, the subject is matched against the first alternative; |
| otherwise it is matched against the second. This pattern matches |
| strings in one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are |
| letters and dd are digits. |
| |
| When an assertion that is a condition contains capture groups, any cap- |
| turing that occurs in a matching branch is retained afterwards, for |
| both positive and negative assertions, because matching always contin- |
| ues after the assertion, whether it succeeds or fails. (Compare non- |
| conditional assertions, for which captures are retained only for posi- |
| tive assertions that succeed.) |
| |
| |
| COMMENTS |
| |
| There are two ways of including comments in patterns that are processed |
| by PCRE2. In both cases, the start of the comment must not be in a |
| character class, nor in the middle of any other sequence of related |
| characters such as (?: or a group name or number. The characters that |
| make up a comment play no part in the pattern matching. |
| |
| The sequence (?# marks the start of a comment that continues up to the |
| next closing parenthesis. Nested parentheses are not permitted. If the |
| PCRE2_EXTENDED or PCRE2_EXTENDED_MORE option is set, an unescaped # |
| character also introduces a comment, which in this case continues to |
| immediately after the next newline character or character sequence in |
| the pattern. Which characters are interpreted as newlines is controlled |
| by an option passed to the compiling function or by a special sequence |
| at the start of the pattern, as described in the section entitled "New- |
| line conventions" above. Note that the end of this type of comment is a |
| literal newline sequence in the pattern; escape sequences that happen |
| to represent a newline do not count. For example, consider this pattern |
| when PCRE2_EXTENDED is set, and the default newline convention (a sin- |
| gle linefeed character) is in force: |
| |
| abc #comment \n still comment |
| |
| On encountering the # character, pcre2_compile() skips along, looking |
| for a newline in the pattern. The sequence \n is still literal at this |
| stage, so it does not terminate the comment. Only an actual character |
| with the code value 0x0a (the default newline) does so. |
| |
| |
| RECURSIVE PATTERNS |
| |
| Consider the problem of matching a string in parentheses, allowing for |
| unlimited nested parentheses. Without the use of recursion, the best |
| that can be done is to use a pattern that matches up to some fixed |
| depth of nesting. It is not possible to handle an arbitrary nesting |
| depth. |
| |
| For some time, Perl has provided a facility that allows regular expres- |
| sions to recurse (amongst other things). It does this by interpolating |
| Perl code in the expression at run time, and the code can refer to the |
| expression itself. A Perl pattern using code interpolation to solve the |
| parentheses problem can be created like this: |
| |
| $re = qr{\( (?: (?>[^()]+) | (?p{$re}) )* \)}x; |
| |
| The (?p{...}) item interpolates Perl code at run time, and in this case |
| refers recursively to the pattern in which it appears. |
| |
| Obviously, PCRE2 cannot support the interpolation of Perl code. In- |
| stead, it supports special syntax for recursion of the entire pattern, |
| and also for individual capture group recursion. After its introduction |
| in PCRE1 and Python, this kind of recursion was subsequently introduced |
| into Perl at release 5.10. |
| |
| A special item that consists of (? followed by a number greater than |
| zero and a closing parenthesis is a recursive subroutine call of the |
| capture group of the given number, provided that it occurs inside that |
| group. (If not, it is a non-recursive subroutine call, which is de- |
| scribed in the next section.) The special item (?R) or (?0) is a recur- |
| sive call of the entire regular expression. |
| |
| This PCRE2 pattern solves the nested parentheses problem (assume the |
| PCRE2_EXTENDED option is set so that white space is ignored): |
| |
| \( ( [^()]++ | (?R) )* \) |
| |
| First it matches an opening parenthesis. Then it matches any number of |
| substrings which can either be a sequence of non-parentheses, or a re- |
| cursive match of the pattern itself (that is, a correctly parenthesized |
| substring). Finally there is a closing parenthesis. Note the use of a |
| possessive quantifier to avoid backtracking into sequences of non- |
| parentheses. |
| |
| If this were part of a larger pattern, you would not want to recurse |
| the entire pattern, so instead you could use this: |
| |
| ( \( ( [^()]++ | (?1) )* \) ) |
| |
| We have put the pattern into parentheses, and caused the recursion to |
| refer to them instead of the whole pattern. |
| |
| In a larger pattern, keeping track of parenthesis numbers can be |
| tricky. This is made easier by the use of relative references. Instead |
| of (?1) in the pattern above you can write (?-2) to refer to the second |
| most recently opened parentheses preceding the recursion. In other |
| words, a negative number counts capturing parentheses leftwards from |
| the point at which it is encountered. |
| |
| Be aware however, that if duplicate capture group numbers are in use, |
| relative references refer to the earliest group with the appropriate |
| number. Consider, for example: |
| |
| (?|(a)|(b)) (c) (?-2) |
| |
| The first two capture groups (a) and (b) are both numbered 1, and group |
| (c) is number 2. When the reference (?-2) is encountered, the second |
| most recently opened parentheses has the number 1, but it is the first |
| such group (the (a) group) to which the recursion refers. This would be |
| the same if an absolute reference (?1) was used. In other words, rela- |
| tive references are just a shorthand for computing a group number. |
| |
| It is also possible to refer to subsequent capture groups, by writing |
| references such as (?+2). However, these cannot be recursive because |
| the reference is not inside the parentheses that are referenced. They |
| are always non-recursive subroutine calls, as described in the next |
| section. |
| |
| An alternative approach is to use named parentheses. The Perl syntax |
| for this is (?&name); PCRE1's earlier syntax (?P>name) is also sup- |
| ported. We could rewrite the above example as follows: |
| |
| (?<pn> \( ( [^()]++ | (?&pn) )* \) ) |
| |
| If there is more than one group with the same name, the earliest one is |
| used. |
| |
| The example pattern that we have been looking at contains nested unlim- |
| ited repeats, and so the use of a possessive quantifier for matching |
| strings of non-parentheses is important when applying the pattern to |
| strings that do not match. For example, when this pattern is applied to |
| |
| (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa() |
| |
| it yields "no match" quickly. However, if a possessive quantifier is |
| not used, the match runs for a very long time indeed because there are |
| so many different ways the + and * repeats can carve up the subject, |
| and all have to be tested before failure can be reported. |
| |
| At the end of a match, the values of capturing parentheses are those |
| from the outermost level. If you want to obtain intermediate values, a |
| callout function can be used (see below and the pcre2callout documenta- |
| tion). If the pattern above is matched against |
| |
| (ab(cd)ef) |
| |
| the value for the inner capturing parentheses (numbered 2) is "ef", |
| which is the last value taken on at the top level. If a capture group |
| is not matched at the top level, its final captured value is unset, |
| even if it was (temporarily) set at a deeper level during the matching |
| process. |
| |
| Do not confuse the (?R) item with the condition (R), which tests for |
| recursion. Consider this pattern, which matches text in angle brack- |
| ets, allowing for arbitrary nesting. Only digits are allowed in nested |
| brackets (that is, when recursing), whereas any characters are permit- |
| ted at the outer level. |
| |
| < (?: (?(R) \d++ | [^<>]*+) | (?R)) * > |
| |
| In this pattern, (?(R) is the start of a conditional group, with two |
| different alternatives for the recursive and non-recursive cases. The |
| (?R) item is the actual recursive call. |
| |
| Differences in recursion processing between PCRE2 and Perl |
| |
| Some former differences between PCRE2 and Perl no longer exist. |
| |
| Before release 10.30, recursion processing in PCRE2 differed from Perl |
| in that a recursive subroutine call was always treated as an atomic |
| group. That is, once it had matched some of the subject string, it was |
| never re-entered, even if it contained untried alternatives and there |
| was a subsequent matching failure. (Historical note: PCRE implemented |
| recursion before Perl did.) |
| |
| Starting with release 10.30, recursive subroutine calls are no longer |
| treated as atomic. That is, they can be re-entered to try unused alter- |
| natives if there is a matching failure later in the pattern. This is |
| now compatible with the way Perl works. If you want a subroutine call |
| to be atomic, you must explicitly enclose it in an atomic group. |
| |
| Supporting backtracking into recursions simplifies certain types of re- |
| cursive pattern. For example, this pattern matches palindromic strings: |
| |
| ^((.)(?1)\2|.?)$ |
| |
| The second branch in the group matches a single central character in |
| the palindrome when there are an odd number of characters, or nothing |
| when there are an even number of characters, but in order to work it |
| has to be able to try the second case when the rest of the pattern |
| match fails. If you want to match typical palindromic phrases, the pat- |
| tern has to ignore all non-word characters, which can be done like |
| this: |
| |
| ^\W*+((.)\W*+(?1)\W*+\2|\W*+.?)\W*+$ |
| |
| If run with the PCRE2_CASELESS option, this pattern matches phrases |
| such as "A man, a plan, a canal: Panama!". Note the use of the posses- |
| sive quantifier *+ to avoid backtracking into sequences of non-word |
| characters. Without this, PCRE2 takes a great deal longer (ten times or |
| more) to match typical phrases, and Perl takes so long that you think |
| it has gone into a loop. |
| |
| Another way in which PCRE2 and Perl used to differ in their recursion |
| processing is in the handling of captured values. Formerly in Perl, |
| when a group was called recursively or as a subroutine (see the next |
| section), it had no access to any values that were captured outside the |
| recursion, whereas in PCRE2 these values can be referenced. Consider |
| this pattern: |
| |
| ^(.)(\1|a(?2)) |
| |
| This pattern matches "bab". The first capturing parentheses match "b", |
| then in the second group, when the backreference \1 fails to match "b", |
| the second alternative matches "a" and then recurses. In the recursion, |
| \1 does now match "b" and so the whole match succeeds. This match used |
| to fail in Perl, but in later versions (I tried 5.024) it now works. |
| |
| |
| GROUPS AS SUBROUTINES |
| |
| If the syntax for a recursive group call (either by number or by name) |
| is used outside the parentheses to which it refers, it operates a bit |
| like a subroutine in a programming language. More accurately, PCRE2 |
| treats the referenced group as an independent subpattern which it tries |
| to match at the current matching position. The called group may be de- |
| fined before or after the reference. A numbered reference can be abso- |
| lute or relative, as in these examples: |
| |
| (...(absolute)...)...(?2)... |
| (...(relative)...)...(?-1)... |
| (...(?+1)...(relative)... |
| |
| An earlier example pointed out that the pattern |
| |
| (sens|respons)e and \1ibility |
| |
| matches "sense and sensibility" and "response and responsibility", but |
| not "sense and responsibility". If instead the pattern |
| |
| (sens|respons)e and (?1)ibility |
| |
| is used, it does match "sense and responsibility" as well as the other |
| two strings. Another example is given in the discussion of DEFINE |
| above. |
| |
| Like recursions, subroutine calls used to be treated as atomic, but |
| this changed at PCRE2 release 10.30, so backtracking into subroutine |
| calls can now occur. However, any capturing parentheses that are set |
| during the subroutine call revert to their previous values afterwards. |
| |
| Processing options such as case-independence are fixed when a group is |
| defined, so if it is used as a subroutine, such options cannot be |
| changed for different calls. For example, consider this pattern: |
| |
| (abc)(?i:(?-1)) |
| |
| It matches "abcabc". It does not match "abcABC" because the change of |
| processing option does not affect the called group. |
| |
| The behaviour of backtracking control verbs in groups when called as |
| subroutines is described in the section entitled "Backtracking verbs in |
| subroutines" below. |
| |
| |
| ONIGURUMA SUBROUTINE SYNTAX |
| |
| For compatibility with Oniguruma, the non-Perl syntax \g followed by a |
| name or a number enclosed either in angle brackets or single quotes, is |
| an alternative syntax for calling a group as a subroutine, possibly re- |
| cursively. Here are two of the examples used above, rewritten using |
| this syntax: |
| |
| (?<pn> \( ( (?>[^()]+) | \g<pn> )* \) ) |
| (sens|respons)e and \g'1'ibility |
| |
| PCRE2 supports an extension to Oniguruma: if a number is preceded by a |
| plus or a minus sign it is taken as a relative reference. For example: |
| |
| (abc)(?i:\g<-1>) |
| |
| Note that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are not |
| synonymous. The former is a backreference; the latter is a subroutine |
| call. |
| |
| |
| CALLOUTS |
| |
| Perl has a feature whereby using the sequence (?{...}) causes arbitrary |
| Perl code to be obeyed in the middle of matching a regular expression. |
| This makes it possible, amongst other things, to extract different sub- |
| strings that match the same pair of parentheses when there is a repeti- |
| tion. |
| |
| PCRE2 provides a similar feature, but of course it cannot obey arbi- |
| trary Perl code. The feature is called "callout". The caller of PCRE2 |
| provides an external function by putting its entry point in a match |
| context using the function pcre2_set_callout(), and then passing that |
| context to pcre2_match() or pcre2_dfa_match(). If no match context is |
| passed, or if the callout entry point is set to NULL, callouts are dis- |
| abled. |
| |
| Within a regular expression, (?C<arg>) indicates a point at which the |
| external function is to be called. There are two kinds of callout: |
| those with a numerical argument and those with a string argument. (?C) |
| on its own with no argument is treated as (?C0). A numerical argument |
| allows the application to distinguish between different callouts. |
| String arguments were added for release 10.20 to make it possible for |
| script languages that use PCRE2 to embed short scripts within patterns |
| in a similar way to Perl. |
| |
| During matching, when PCRE2 reaches a callout point, the external func- |
| tion is called. It is provided with the number or string argument of |
| the callout, the position in the pattern, and one item of data that is |
| also set in the match block. The callout function may cause matching to |
| proceed, to backtrack, or to fail. |
| |
| By default, PCRE2 implements a number of optimizations at matching |
| time, and one side-effect is that sometimes callouts are skipped. If |
| you need all possible callouts to happen, you need to set options that |
| disable the relevant optimizations. More details, including a complete |
| description of the programming interface to the callout function, are |
| given in the pcre2callout documentation. |
| |
| Callouts with numerical arguments |
| |
| If you just want to have a means of identifying different callout |
| points, put a number less than 256 after the letter C. For example, |
| this pattern has two callout points: |
| |
| (?C1)abc(?C2)def |
| |
| If the PCRE2_AUTO_CALLOUT flag is passed to pcre2_compile(), numerical |
| callouts are automatically installed before each item in the pattern. |
| They are all numbered 255. If there is a conditional group in the pat- |
| tern whose condition is an assertion, an additional callout is inserted |
| just before the condition. An explicit callout may also be set at this |
| position, as in this example: |
| |
| (?(?C9)(?=a)abc|def) |
| |
| Note that this applies only to assertion conditions, not to other types |
| of condition. |
| |
| Callouts with string arguments |
| |
| A delimited string may be used instead of a number as a callout argu- |
| ment. The starting delimiter must be one of ` ' " ^ % # $ { and the |
| ending delimiter is the same as the start, except for {, where the end- |
| ing delimiter is }. If the ending delimiter is needed within the |
| string, it must be doubled. For example: |
| |
| (?C'ab ''c'' d')xyz(?C{any text})pqr |
| |
| The doubling is removed before the string is passed to the callout |
| function. |
| |
| |
| BACKTRACKING CONTROL |
| |
| There are a number of special "Backtracking Control Verbs" (to use |
| Perl's terminology) that modify the behaviour of backtracking during |
| matching. They are generally of the form (*VERB) or (*VERB:NAME). Some |
| verbs take either form, and may behave differently depending on whether |
| or not a name argument is present. The names are not required to be |
| unique within the pattern. |
| |
| By default, for compatibility with Perl, a name is any sequence of |
| characters that does not include a closing parenthesis. The name is not |
| processed in any way, and it is not possible to include a closing |
| parenthesis in the name. This can be changed by setting the |
| PCRE2_ALT_VERBNAMES option, but the result is no longer Perl-compati- |
| ble. |
| |
| When PCRE2_ALT_VERBNAMES is set, backslash processing is applied to |
| verb names and only an unescaped closing parenthesis terminates the |
| name. However, the only backslash items that are permitted are \Q, \E, |
| and sequences such as \x{100} that define character code points. Char- |
| acter type escapes such as \d are faulted. |
| |
| A closing parenthesis can be included in a name either as \) or between |
| \Q and \E. In addition to backslash processing, if the PCRE2_EXTENDED |
| or PCRE2_EXTENDED_MORE option is also set, unescaped whitespace in verb |
| names is skipped, and #-comments are recognized, exactly as in the rest |
| of the pattern. PCRE2_EXTENDED and PCRE2_EXTENDED_MORE do not affect |
| verb names unless PCRE2_ALT_VERBNAMES is also set. |
| |
| The maximum length of a name is 255 in the 8-bit library and 65535 in |
| the 16-bit and 32-bit libraries. If the name is empty, that is, if the |
| closing parenthesis immediately follows the colon, the effect is as if |
| the colon were not there. Any number of these verbs may occur in a pat- |
| tern. Except for (*ACCEPT), they may not be quantified. |
| |
| Since these verbs are specifically related to backtracking, most of |
| them can be used only when the pattern is to be matched using the tra- |
| ditional matching function, because that uses a backtracking algorithm. |
| With the exception of (*FAIL), which behaves like a failing negative |
| assertion, the backtracking control verbs cause an error if encountered |
| by the DFA matching function. |
| |
| The behaviour of these verbs in repeated groups, assertions, and in |
| capture groups called as subroutines (whether or not recursively) is |
| documented below. |
| |
| Optimizations that affect backtracking verbs |
| |
| PCRE2 contains some optimizations that are used to speed up matching by |
| running some checks at the start of each match attempt. For example, it |
| may know the minimum length of matching subject, or that a particular |
| character must be present. When one of these optimizations bypasses the |
| running of a match, any included backtracking verbs will not, of |
| course, be processed. You can suppress the start-of-match optimizations |
| by setting the PCRE2_NO_START_OPTIMIZE option when calling pcre2_com- |
| pile(), or by starting the pattern with (*NO_START_OPT). There is more |
| discussion of this option in the section entitled "Compiling a pattern" |
| in the pcre2api documentation. |
| |
| Experiments with Perl suggest that it too has similar optimizations, |
| and like PCRE2, turning them off can change the result of a match. |
| |
| Verbs that act immediately |
| |
| The following verbs act as soon as they are encountered. |
| |
| (*ACCEPT) or (*ACCEPT:NAME) |
| |
| This verb causes the match to end successfully, skipping the remainder |
| of the pattern. However, when it is inside a capture group that is |
| called as a subroutine, only that group is ended successfully. Matching |
| then continues at the outer level. If (*ACCEPT) in triggered in a posi- |
| tive assertion, the assertion succeeds; in a negative assertion, the |
| assertion fails. |
| |
| If (*ACCEPT) is inside capturing parentheses, the data so far is cap- |
| tured. For example: |
| |
| A((?:A|B(*ACCEPT)|C)D) |
| |
| This matches "AB", "AAD", or "ACD"; when it matches "AB", "B" is cap- |
| tured by the outer parentheses. |
| |
| (*ACCEPT) is the only backtracking verb that is allowed to be quanti- |
| fied because an ungreedy quantification with a minimum of zero acts |
| only when a backtrack happens. Consider, for example, |
| |
| (A(*ACCEPT)??B)C |
| |
| where A, B, and C may be complex expressions. After matching "A", the |
| matcher processes "BC"; if that fails, causing a backtrack, (*ACCEPT) |
| is triggered and the match succeeds. In both cases, all but C is cap- |
| tured. Whereas (*COMMIT) (see below) means "fail on backtrack", a re- |
| peated (*ACCEPT) of this type means "succeed on backtrack". |
| |
| Warning: (*ACCEPT) should not be used within a script run group, be- |
| cause it causes an immediate exit from the group, bypassing the script |
| run checking. |
| |
| (*FAIL) or (*FAIL:NAME) |
| |
| This verb causes a matching failure, forcing backtracking to occur. It |
| may be abbreviated to (*F). It is equivalent to (?!) but easier to |
| read. The Perl documentation notes that it is probably useful only when |
| combined with (?{}) or (??{}). Those are, of course, Perl features that |
| are not present in PCRE2. The nearest equivalent is the callout fea- |
| ture, as for example in this pattern: |
| |
| a+(?C)(*FAIL) |
| |
| A match with the string "aaaa" always fails, but the callout is taken |
| before each backtrack happens (in this example, 10 times). |
| |
| (*ACCEPT:NAME) and (*FAIL:NAME) behave the same as (*MARK:NAME)(*AC- |
| CEPT) and (*MARK:NAME)(*FAIL), respectively, that is, a (*MARK) is |
| recorded just before the verb acts. |
| |
| Recording which path was taken |
| |
| There is one verb whose main purpose is to track how a match was ar- |
| rived at, though it also has a secondary use in conjunction with ad- |
| vancing the match starting point (see (*SKIP) below). |
| |
| (*MARK:NAME) or (*:NAME) |
| |
| A name is always required with this verb. For all the other backtrack- |
| ing control verbs, a NAME argument is optional. |
| |
| When a match succeeds, the name of the last-encountered mark name on |
| the matching path is passed back to the caller as described in the sec- |
| tion entitled "Other information about the match" in the pcre2api docu- |
| mentation. This applies to all instances of (*MARK) and other verbs, |
| including those inside assertions and atomic groups. However, there are |
| differences in those cases when (*MARK) is used in conjunction with |
| (*SKIP) as described below. |
| |
| The mark name that was last encountered on the matching path is passed |
| back. A verb without a NAME argument is ignored for this purpose. Here |
| is an example of pcre2test output, where the "mark" modifier requests |
| the retrieval and outputting of (*MARK) data: |
| |
| re> /X(*MARK:A)Y|X(*MARK:B)Z/mark |
| data> XY |
| 0: XY |
| MK: A |
| XZ |
| 0: XZ |
| MK: B |
| |
| The (*MARK) name is tagged with "MK:" in this output, and in this exam- |
| ple it indicates which of the two alternatives matched. This is a more |
| efficient way of obtaining this information than putting each alterna- |
| tive in its own capturing parentheses. |
| |
| If a verb with a name is encountered in a positive assertion that is |
| true, the name is recorded and passed back if it is the last-encoun- |
| tered. This does not happen for negative assertions or failing positive |
| assertions. |
| |
| After a partial match or a failed match, the last encountered name in |
| the entire match process is returned. For example: |
| |
| re> /X(*MARK:A)Y|X(*MARK:B)Z/mark |
| data> XP |
| No match, mark = B |
| |
| Note that in this unanchored example the mark is retained from the |
| match attempt that started at the letter "X" in the subject. Subsequent |
| match attempts starting at "P" and then with an empty string do not get |
| as far as the (*MARK) item, but nevertheless do not reset it. |
| |
| If you are interested in (*MARK) values after failed matches, you |
| should probably set the PCRE2_NO_START_OPTIMIZE option (see above) to |
| ensure that the match is always attempted. |
| |
| Verbs that act after backtracking |
| |
| The following verbs do nothing when they are encountered. Matching con- |
| tinues with what follows, but if there is a subsequent match failure, |
| causing a backtrack to the verb, a failure is forced. That is, back- |
| tracking cannot pass to the left of the verb. However, when one of |
| these verbs appears inside an atomic group or in a lookaround assertion |
| that is true, its effect is confined to that group, because once the |
| group has been matched, there is never any backtracking into it. Back- |
| tracking from beyond an assertion or an atomic group ignores the entire |
| group, and seeks a preceding backtracking point. |
| |
| These verbs differ in exactly what kind of failure occurs when back- |
| tracking reaches them. The behaviour described below is what happens |
| when the verb is not in a subroutine or an assertion. Subsequent sec- |
| tions cover these special cases. |
| |
| (*COMMIT) or (*COMMIT:NAME) |
| |
| This verb causes the whole match to fail outright if there is a later |
| matching failure that causes backtracking to reach it. Even if the pat- |
| tern is unanchored, no further attempts to find a match by advancing |
| the starting point take place. If (*COMMIT) is the only backtracking |
| verb that is encountered, once it has been passed pcre2_match() is com- |
| mitted to finding a match at the current starting point, or not at all. |
| For example: |
| |
| a+(*COMMIT)b |
| |
| This matches "xxaab" but not "aacaab". It can be thought of as a kind |
| of dynamic anchor, or "I've started, so I must finish." |
| |
| The behaviour of (*COMMIT:NAME) is not the same as (*MARK:NAME)(*COM- |
| MIT). It is like (*MARK:NAME) in that the name is remembered for pass- |
| ing back to the caller. However, (*SKIP:NAME) searches only for names |
| that are set with (*MARK), ignoring those set by any of the other back- |
| tracking verbs. |
| |
| If there is more than one backtracking verb in a pattern, a different |
| one that follows (*COMMIT) may be triggered first, so merely passing |
| (*COMMIT) during a match does not always guarantee that a match must be |
| at this starting point. |
| |
| Note that (*COMMIT) at the start of a pattern is not the same as an an- |
| chor, unless PCRE2's start-of-match optimizations are turned off, as |
| shown in this output from pcre2test: |
| |
| re> /(*COMMIT)abc/ |
| data> xyzabc |
| 0: abc |
| data> |
| re> /(*COMMIT)abc/no_start_optimize |
| data> xyzabc |
| No match |
| |
| For the first pattern, PCRE2 knows that any match must start with "a", |
| so the optimization skips along the subject to "a" before applying the |
| pattern to the first set of data. The match attempt then succeeds. The |
| second pattern disables the optimization that skips along to the first |
| character. The pattern is now applied starting at "x", and so the |
| (*COMMIT) causes the match to fail without trying any other starting |
| points. |
| |
| (*PRUNE) or (*PRUNE:NAME) |
| |
| This verb causes the match to fail at the current starting position in |
| the subject if there is a later matching failure that causes backtrack- |
| ing to reach it. If the pattern is unanchored, the normal "bumpalong" |
| advance to the next starting character then happens. Backtracking can |
| occur as usual to the left of (*PRUNE), before it is reached, or when |
| matching to the right of (*PRUNE), but if there is no match to the |
| right, backtracking cannot cross (*PRUNE). In simple cases, the use of |
| (*PRUNE) is just an alternative to an atomic group or possessive quan- |
| tifier, but there are some uses of (*PRUNE) that cannot be expressed in |
| any other way. In an anchored pattern (*PRUNE) has the same effect as |
| (*COMMIT). |
| |
| The behaviour of (*PRUNE:NAME) is not the same as (*MARK:NAME)(*PRUNE). |
| It is like (*MARK:NAME) in that the name is remembered for passing back |
| to the caller. However, (*SKIP:NAME) searches only for names set with |
| (*MARK), ignoring those set by other backtracking verbs. |
| |
| (*SKIP) |
| |
| This verb, when given without a name, is like (*PRUNE), except that if |
| the pattern is unanchored, the "bumpalong" advance is not to the next |
| character, but to the position in the subject where (*SKIP) was encoun- |
| tered. (*SKIP) signifies that whatever text was matched leading up to |
| it cannot be part of a successful match if there is a later mismatch. |
| Consider: |
| |
| a+(*SKIP)b |
| |
| If the subject is "aaaac...", after the first match attempt fails |
| (starting at the first character in the string), the starting point |
| skips on to start the next attempt at "c". Note that a possessive quan- |
| tifier does not have the same effect as this example; although it would |
| suppress backtracking during the first match attempt, the second at- |
| tempt would start at the second character instead of skipping on to |
| "c". |
| |
| If (*SKIP) is used to specify a new starting position that is the same |
| as the starting position of the current match, or (by being inside a |
| lookbehind) earlier, the position specified by (*SKIP) is ignored, and |
| instead the normal "bumpalong" occurs. |
| |
| (*SKIP:NAME) |
| |
| When (*SKIP) has an associated name, its behaviour is modified. When |
| such a (*SKIP) is triggered, the previous path through the pattern is |
| searched for the most recent (*MARK) that has the same name. If one is |
| found, the "bumpalong" advance is to the subject position that corre- |
| sponds to that (*MARK) instead of to where (*SKIP) was encountered. If |
| no (*MARK) with a matching name is found, the (*SKIP) is ignored. |
| |
| The search for a (*MARK) name uses the normal backtracking mechanism, |
| which means that it does not see (*MARK) settings that are inside |
| atomic groups or assertions, because they are never re-entered by back- |
| tracking. Compare the following pcre2test examples: |
| |
| re> /a(?>(*MARK:X))(*SKIP:X)(*F)|(.)/ |
| data: abc |
| 0: a |
| 1: a |
| data: |
| re> /a(?:(*MARK:X))(*SKIP:X)(*F)|(.)/ |
| data: abc |
| 0: b |
| 1: b |
| |
| In the first example, the (*MARK) setting is in an atomic group, so it |
| is not seen when (*SKIP:X) triggers, causing the (*SKIP) to be ignored. |
| This allows the second branch of the pattern to be tried at the first |
| character position. In the second example, the (*MARK) setting is not |
| in an atomic group. This allows (*SKIP:X) to find the (*MARK) when it |
| backtracks, and this causes a new matching attempt to start at the sec- |
| ond character. This time, the (*MARK) is never seen because "a" does |
| not match "b", so the matcher immediately jumps to the second branch of |
| the pattern. |
| |
| Note that (*SKIP:NAME) searches only for names set by (*MARK:NAME). It |
| ignores names that are set by other backtracking verbs. |
| |
| (*THEN) or (*THEN:NAME) |
| |
| This verb causes a skip to the next innermost alternative when back- |
| tracking reaches it. That is, it cancels any further backtracking |
| within the current alternative. Its name comes from the observation |
| that it can be used for a pattern-based if-then-else block: |
| |
| ( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ... |
| |
| If the COND1 pattern matches, FOO is tried (and possibly further items |
| after the end of the group if FOO succeeds); on failure, the matcher |
| skips to the second alternative and tries COND2, without backtracking |
| into COND1. If that succeeds and BAR fails, COND3 is tried. If subse- |
| quently BAZ fails, there are no more alternatives, so there is a back- |
| track to whatever came before the entire group. If (*THEN) is not in- |
| side an alternation, it acts like (*PRUNE). |
| |
| The behaviour of (*THEN:NAME) is not the same as (*MARK:NAME)(*THEN). |
| It is like (*MARK:NAME) in that the name is remembered for passing back |
| to the caller. However, (*SKIP:NAME) searches only for names set with |
| (*MARK), ignoring those set by other backtracking verbs. |
| |
| A group that does not contain a | character is just a part of the en- |
| closing alternative; it is not a nested alternation with only one al- |
| ternative. The effect of (*THEN) extends beyond such a group to the en- |
| closing alternative. Consider this pattern, where A, B, etc. are com- |
| plex pattern fragments that do not contain any | characters at this |
| level: |
| |
| A (B(*THEN)C) | D |
| |
| If A and B are matched, but there is a failure in C, matching does not |
| backtrack into A; instead it moves to the next alternative, that is, D. |
| However, if the group containing (*THEN) is given an alternative, it |
| behaves differently: |
| |
| A (B(*THEN)C | (*FAIL)) | D |
| |
| The effect of (*THEN) is now confined to the inner group. After a fail- |
| ure in C, matching moves to (*FAIL), which causes the whole group to |
| fail because there are no more alternatives to try. In this case, |
| matching does backtrack into A. |
| |
| Note that a conditional group is not considered as having two alterna- |
| tives, because only one is ever used. In other words, the | character |
| in a conditional group has a different meaning. Ignoring white space, |
| consider: |
| |
| ^.*? (?(?=a) a | b(*THEN)c ) |
| |
| If the subject is "ba", this pattern does not match. Because .*? is un- |
| greedy, it initially matches zero characters. The condition (?=a) then |
| fails, the character "b" is matched, but "c" is not. At this point, |
| matching does not backtrack to .*? as might perhaps be expected from |
| the presence of the | character. The conditional group is part of the |
| single alternative that comprises the whole pattern, and so the match |
| fails. (If there was a backtrack into .*?, allowing it to match "b", |
| the match would succeed.) |
| |
| The verbs just described provide four different "strengths" of control |
| when subsequent matching fails. (*THEN) is the weakest, carrying on the |
| match at the next alternative. (*PRUNE) comes next, failing the match |
| at the current starting position, but allowing an advance to the next |
| character (for an unanchored pattern). (*SKIP) is similar, except that |
| the advance may be more than one character. (*COMMIT) is the strongest, |
| causing the entire match to fail. |
| |
| More than one backtracking verb |
| |
| If more than one backtracking verb is present in a pattern, the one |
| that is backtracked onto first acts. For example, consider this pat- |
| tern, where A, B, etc. are complex pattern fragments: |
| |
| (A(*COMMIT)B(*THEN)C|ABD) |
| |
| If A matches but B fails, the backtrack to (*COMMIT) causes the entire |
| match to fail. However, if A and B match, but C fails, the backtrack to |
| (*THEN) causes the next alternative (ABD) to be tried. This behaviour |
| is consistent, but is not always the same as Perl's. It means that if |
| two or more backtracking verbs appear in succession, all the the last |
| of them has no effect. Consider this example: |
| |
| ...(*COMMIT)(*PRUNE)... |
| |
| If there is a matching failure to the right, backtracking onto (*PRUNE) |
| causes it to be triggered, and its action is taken. There can never be |
| a backtrack onto (*COMMIT). |
| |
| Backtracking verbs in repeated groups |
| |
| PCRE2 sometimes differs from Perl in its handling of backtracking verbs |
| in repeated groups. For example, consider: |
| |
| /(a(*COMMIT)b)+ac/ |
| |
| If the subject is "abac", Perl matches unless its optimizations are |
| disabled, but PCRE2 always fails because the (*COMMIT) in the second |
| repeat of the group acts. |
| |
| Backtracking verbs in assertions |
| |
| (*FAIL) in any assertion has its normal effect: it forces an immediate |
| backtrack. The behaviour of the other backtracking verbs depends on |
| whether or not the assertion is standalone or acting as the condition |
| in a conditional group. |
| |
| (*ACCEPT) in a standalone positive assertion causes the assertion to |
| succeed without any further processing; captured strings and a mark |
| name (if set) are retained. In a standalone negative assertion, (*AC- |
| CEPT) causes the assertion to fail without any further processing; cap- |
| tured substrings and any mark name are discarded. |
| |
| If the assertion is a condition, (*ACCEPT) causes the condition to be |
| true for a positive assertion and false for a negative one; captured |
| substrings are retained in both cases. |
| |
| The remaining verbs act only when a later failure causes a backtrack to |
| reach them. This means that, for the Perl-compatible assertions, their |
| effect is confined to the assertion, because Perl lookaround assertions |
| are atomic. A backtrack that occurs after such an assertion is complete |
| does not jump back into the assertion. Note in particular that a |
| (*MARK) name that is set in an assertion is not "seen" by an instance |
| of (*SKIP:NAME) later in the pattern. |
| |
| PCRE2 now supports non-atomic positive assertions, as described in the |
| section entitled "Non-atomic assertions" above. These assertions must |
| be standalone (not used as conditions). They are not Perl-compatible. |
| For these assertions, a later backtrack does jump back into the asser- |
| tion, and therefore verbs such as (*COMMIT) can be triggered by back- |
| tracks from later in the pattern. |
| |
| The effect of (*THEN) is not allowed to escape beyond an assertion. If |
| there are no more branches to try, (*THEN) causes a positive assertion |
| to be false, and a negative assertion to be true. |
| |
| The other backtracking verbs are not treated specially if they appear |
| in a standalone positive assertion. In a conditional positive asser- |
| tion, backtracking (from within the assertion) into (*COMMIT), (*SKIP), |
| or (*PRUNE) causes the condition to be false. However, for both stand- |
| alone and conditional negative assertions, backtracking into (*COMMIT), |
| (*SKIP), or (*PRUNE) causes the assertion to be true, without consider- |
| ing any further alternative branches. |
| |
| Backtracking verbs in subroutines |
| |
| These behaviours occur whether or not the group is called recursively. |
| |
| (*ACCEPT) in a group called as a subroutine causes the subroutine match |
| to succeed without any further processing. Matching then continues af- |
| ter the subroutine call. Perl documents this behaviour. Perl's treat- |
| ment of the other verbs in subroutines is different in some cases. |
| |
| (*FAIL) in a group called as a subroutine has its normal effect: it |
| forces an immediate backtrack. |
| |
| (*COMMIT), (*SKIP), and (*PRUNE) cause the subroutine match to fail |
| when triggered by being backtracked to in a group called as a subrou- |
| tine. There is then a backtrack at the outer level. |
| |
| (*THEN), when triggered, skips to the next alternative in the innermost |
| enclosing group that has alternatives (its normal behaviour). However, |
| if there is no such group within the subroutine's group, the subroutine |
| match fails and there is a backtrack at the outer level. |
| |
| |
| SEE ALSO |
| |
| pcre2api(3), pcre2callout(3), pcre2matching(3), pcre2syntax(3), |
| pcre2(3). |
| |
| |
| AUTHOR |
| |
| Philip Hazel |
| Retired from University Computing Service |
| Cambridge, England. |
| |
| |
| REVISION |
| |
| Last updated: 12 January 2022 |
| Copyright (c) 1997-2022 University of Cambridge. |
| ------------------------------------------------------------------------------ |
| |
| |
| PCRE2PERFORM(3) Library Functions Manual PCRE2PERFORM(3) |
| |
| |
| |
| NAME |
| PCRE2 - Perl-compatible regular expressions (revised API) |
| |
| PCRE2 PERFORMANCE |
| |
| Two aspects of performance are discussed below: memory usage and pro- |
| cessing time. The way you express your pattern as a regular expression |
| can affect both of them. |
| |
| |
| COMPILED PATTERN MEMORY USAGE |
| |
| Patterns are compiled by PCRE2 into a reasonably efficient interpretive |
| code, so that most simple patterns do not use much memory for storing |
| the compiled version. However, there is one case where the memory usage |
| of a compiled pattern can be unexpectedly large. If a parenthesized |
| group has a quantifier with a minimum greater than 1 and/or a limited |
| maximum, the whole group is repeated in the compiled code. For example, |
| the pattern |
| |
| (abc|def){2,4} |
| |
| is compiled as if it were |
| |
| (abc|def)(abc|def)((abc|def)(abc|def)?)? |
| |
| (Technical aside: It is done this way so that backtrack points within |
| each of the repetitions can be independently maintained.) |
| |
| For regular expressions whose quantifiers use only small numbers, this |
| is not usually a problem. However, if the numbers are large, and par- |
| ticularly if such repetitions are nested, the memory usage can become |
| an embarrassment. For example, the very simple pattern |
| |
| ((ab){1,1000}c){1,3} |
| |
| uses over 50KiB when compiled using the 8-bit library. When PCRE2 is |
| compiled with its default internal pointer size of two bytes, the size |
| limit on a compiled pattern is 65535 code units in the 8-bit and 16-bit |
| libraries, and this is reached with the above pattern if the outer rep- |
| etition is increased from 3 to 4. PCRE2 can be compiled to use larger |
| internal pointers and thus handle larger compiled patterns, but it is |
| better to try to rewrite your pattern to use less memory if you can. |
| |
| One way of reducing the memory usage for such patterns is to make use |
| of PCRE2's "subroutine" facility. Re-writing the above pattern as |
| |
| ((ab)(?2){0,999}c)(?1){0,2} |
| |
| reduces the memory requirements to around 16KiB, and indeed it remains |
| under 20KiB even with the outer repetition increased to 100. However, |
| this kind of pattern is not always exactly equivalent, because any cap- |
| tures within subroutine calls are lost when the subroutine completes. |
| If this is not a problem, this kind of rewriting will allow you to |
| process patterns that PCRE2 cannot otherwise handle. The matching per- |
| formance of the two different versions of the pattern are roughly the |
| same. (This applies from release 10.30 - things were different in ear- |
| lier releases.) |
| |
| |
| STACK AND HEAP USAGE AT RUN TIME |
| |
| From release 10.30, the interpretive (non-JIT) version of pcre2_match() |
| uses very little system stack at run time. In earlier releases recur- |
| sive function calls could use a great deal of stack, and this could |
| cause problems, but this usage has been eliminated. Backtracking posi- |
| tions are now explicitly remembered in memory frames controlled by the |
| code. |
| |
| The size of each frame depends on the size of pointer variables and the |
| number of capturing parenthesized groups in the pattern being matched. |
| On a 64-bit system the frame size for a pattern with no captures is 128 |
| bytes. For each capturing group the size increases by 16 bytes. |
| |
| Until release 10.41, an initial 20KiB frames vector was allocated on |
| the system stack, but this still caused some issues for multi-thread |
| applications where each thread has a very small stack. From release |
| 10.41 backtracking memory frames are always held in heap memory. An |
| initial heap allocation is obtained the first time any match data block |
| is passed to pcre2_match(). This is remembered with the match data |
| block and re-used if that block is used for another match. It is freed |
| when the match data block itself is freed. |
| |
| The size of the initial block is the larger of 20KiB or ten times the |
| pattern's frame size, unless the heap limit is less than this, in which |
| case the heap limit is used. If the initial block proves to be too |
| small during matching, it is replaced by a larger block, subject to the |
| heap limit. The heap limit is checked only when a new block is to be |
| allocated. Reducing the heap limit between calls to pcre2_match() with |
| the same match data block does not affect the saved block. |
| |
| In contrast to pcre2_match(), pcre2_dfa_match() does use recursive |
| function calls, but only for processing atomic groups, lookaround as- |
| sertions, and recursion within the pattern. The original version of the |
| code used to allocate quite large internal workspace vectors on the |
| stack, which caused some problems for some patterns in environments |
| with small stacks. From release 10.32 the code for pcre2_dfa_match() |
| has been re-factored to use heap memory when necessary for internal |
| workspace when recursing, though recursive function calls are still |
| used. |
| |
| The "match depth" parameter can be used to limit the depth of function |
| recursion, and the "match heap" parameter to limit heap memory in |
| pcre2_dfa_match(). |
| |
| |
| PROCESSING TIME |
| |
| Certain items in regular expression patterns are processed more effi- |
| ciently than others. It is more efficient to use a character class like |
| [aeiou] than a set of single-character alternatives such as |
| (a|e|i|o|u). In general, the simplest construction that provides the |
| required behaviour is usually the most efficient. Jeffrey Friedl's book |
| contains a lot of useful general discussion about optimizing regular |
| expressions for efficient performance. This document contains a few ob- |
| servations about PCRE2. |
| |
| Using Unicode character properties (the \p, \P, and \X escapes) is |
| slow, because PCRE2 has to use a multi-stage table lookup whenever it |
| needs a character's property. If you can find an alternative pattern |
| that does not use character properties, it will probably be faster. |
| |
| By default, the escape sequences \b, \d, \s, and \w, and the POSIX |
| character classes such as [:alpha:] do not use Unicode properties, |
| partly for backwards compatibility, and partly for performance reasons. |
| However, you can set the PCRE2_UCP option or start the pattern with |
| (*UCP) if you want Unicode character properties to be used. This can |
| double the matching time for items such as \d, when matched with |
| pcre2_match(); the performance loss is less with a DFA matching func- |
| tion, and in both cases there is not much difference for \b. |
| |
| When a pattern begins with .* not in atomic parentheses, nor in paren- |
| theses that are the subject of a backreference, and the PCRE2_DOTALL |
| option is set, the pattern is implicitly anchored by PCRE2, since it |
| can match only at the start of a subject string. If the pattern has |
| multiple top-level branches, they must all be anchorable. The optimiza- |
| tion can be disabled by the PCRE2_NO_DOTSTAR_ANCHOR option, and is au- |
| tomatically disabled if the pattern contains (*PRUNE) or (*SKIP). |
| |
| If PCRE2_DOTALL is not set, PCRE2 cannot make this optimization, be- |
| cause the dot metacharacter does not then match a newline, and if the |
| subject string contains newlines, the pattern may match from the char- |
| acter immediately following one of them instead of from the very start. |
| For example, the pattern |
| |
| .*second |
| |
| matches the subject "first\nand second" (where \n stands for a newline |
| character), with the match starting at the seventh character. In order |
| to do this, PCRE2 has to retry the match starting after every newline |
| in the subject. |
| |
| If you are using such a pattern with subject strings that do not con- |
| tain newlines, the best performance is obtained by setting |
| PCRE2_DOTALL, or starting the pattern with ^.* or ^.*? to indicate ex- |
| plicit anchoring. That saves PCRE2 from having to scan along the sub- |
| ject looking for a newline to restart at. |
| |
| Beware of patterns that contain nested indefinite repeats. These can |
| take a long time to run when applied to a string that does not match. |
| Consider the pattern fragment |
| |
| ^(a+)* |
| |
| This can match "aaaa" in 16 different ways, and this number increases |
| very rapidly as the string gets longer. (The * repeat can match 0, 1, |
| 2, 3, or 4 times, and for each of those cases other than 0 or 4, the + |
| repeats can match different numbers of times.) When the remainder of |
| the pattern is such that the entire match is going to fail, PCRE2 has |
| in principle to try every possible variation, and this can take an ex- |
| tremely long time, even for relatively short strings. |
| |
| An optimization catches some of the more simple cases such as |
| |
| (a+)*b |
| |
| where a literal character follows. Before embarking on the standard |
| matching procedure, PCRE2 checks that there is a "b" later in the sub- |
| ject string, and if there is not, it fails the match immediately. How- |
| ever, when there is no following literal this optimization cannot be |
| used. You can see the difference by comparing the behaviour of |
| |
| (a+)*\d |
| |
| with the pattern above. The former gives a failure almost instantly |
| when applied to a whole line of "a" characters, whereas the latter |
| takes an appreciable time with strings longer than about 20 characters. |
| |
| In many cases, the solution to this kind of performance issue is to use |
| an atomic group or a possessive quantifier. This can often reduce mem- |
| ory requirements as well. As another example, consider this pattern: |
| |
| ([^<]|<(?!inet))+ |
| |
| It matches from wherever it starts until it encounters "<inet" or the |
| end of the data, and is the kind of pattern that might be used when |
| processing an XML file. Each iteration of the outer parentheses matches |
| either one character that is not "<" or a "<" that is not followed by |
| "inet". However, each time a parenthesis is processed, a backtracking |
| position is passed, so this formulation uses a memory frame for each |
| matched character. For a long string, a lot of memory is required. Con- |
| sider now this rewritten pattern, which matches exactly the same |
| strings: |
| |
| ([^<]++|<(?!inet))+ |
| |
| This runs much faster, because sequences of characters that do not con- |
| tain "<" are "swallowed" in one item inside the parentheses, and a pos- |
| sessive quantifier is used to stop any backtracking into the runs of |
| non-"<" characters. This version also uses a lot less memory because |
| entry to a new set of parentheses happens only when a "<" character |
| that is not followed by "inet" is encountered (and we assume this is |
| relatively rare). |
| |
| This example shows that one way of optimizing performance when matching |
| long subject strings is to write repeated parenthesized subpatterns to |
| match more than one character whenever possible. |
| |
| SETTING RESOURCE LIMITS |
| |
| You can set limits on the amount of processing that takes place when |
| matching, and on the amount of heap memory that is used. The default |
| values of the limits are very large, and unlikely ever to operate. They |
| can be changed when PCRE2 is built, and they can also be set when |
| pcre2_match() or pcre2_dfa_match() is called. For details of these in- |
| terfaces, see the pcre2build documentation and the section entitled |
| "The match context" in the pcre2api documentation. |
| |
| The pcre2test test program has a modifier called "find_limits" which, |
| if applied to a subject line, causes it to find the smallest limits |
| that allow a pattern to match. This is done by repeatedly matching with |
| different limits. |
| |
| |
| AUTHOR |
| |
| Philip Hazel |
| Retired from University Computing Service |
| Cambridge, England. |
| |
| |
| REVISION |
| |
| Last updated: 27 July 2022 |
| Copyright (c) 1997-2022 University of Cambridge. |
| ------------------------------------------------------------------------------ |
| |
| |
| PCRE2POSIX(3) Library Functions Manual PCRE2POSIX(3) |
| |
| |
| |
| NAME |
| PCRE2 - Perl-compatible regular expressions (revised API) |
| |
| SYNOPSIS |
| |
| #include <pcre2posix.h> |
| |
| int pcre2_regcomp(regex_t *preg, const char *pattern, |
| int cflags); |
| |
| int pcre2_regexec(const regex_t *preg, const char *string, |
| size_t nmatch, regmatch_t pmatch[], int eflags); |
| |
| size_t pcre2_regerror(int errcode, const regex_t *preg, |
| char *errbuf, size_t errbuf_size); |
| |
| void pcre2_regfree(regex_t *preg); |
| |
| |
| DESCRIPTION |
| |
| This set of functions provides a POSIX-style API for the PCRE2 regular |
| expression 8-bit library. There are no POSIX-style wrappers for PCRE2's |
| 16-bit and 32-bit libraries. See the pcre2api documentation for a de- |
| scription of PCRE2's native API, which contains much additional func- |
| tionality. |
| |
| The functions described here are wrapper functions that ultimately call |
| the PCRE2 native API. Their prototypes are defined in the pcre2posix.h |
| header file, and they all have unique names starting with pcre2_. How- |
| ever, the pcre2posix.h header also contains macro definitions that con- |
| vert the standard POSIX names such regcomp() into pcre2_regcomp() etc. |
| This means that a program can use the usual POSIX names without running |
| the risk of accidentally linking with POSIX functions from a different |
| library. |
| |
| On Unix-like systems the PCRE2 POSIX library is called libpcre2-posix, |
| so can be accessed by adding -lpcre2-posix to the command for linking |
| an application. Because the POSIX functions call the native ones, it is |
| also necessary to add -lpcre2-8. |
| |
| Although they were not defined as protypes in pcre2posix.h, releases |
| 10.33 to 10.36 of the library contained functions with the POSIX names |
| regcomp() etc. These simply passed their arguments to the PCRE2 func- |
| tions. These functions were provided for backwards compatibility with |
| earlier versions of PCRE2, which had only POSIX names. However, this |
| has proved troublesome in situations where a program links with several |
| libraries, some of which use PCRE2's POSIX interface while others use |
| the real POSIX functions. For this reason, the POSIX names have been |
| removed since release 10.37. |
| |
| Calling the header file pcre2posix.h avoids any conflict with other |
| POSIX libraries. It can, of course, be renamed or aliased as regex.h, |
| which is the "correct" name, if there is no clash. It provides two |
| structure types, regex_t for compiled internal forms, and regmatch_t |
| for returning captured substrings. It also defines some constants whose |
| names start with "REG_"; these are used for setting options and identi- |
| fying error codes. |
| |
| |
| USING THE POSIX FUNCTIONS |
| |
| Those POSIX option bits that can reasonably be mapped to PCRE2 native |
| options have been implemented. In addition, the option REG_EXTENDED is |
| defined with the value zero. This has no effect, but since programs |
| that are written to the POSIX interface often use it, this makes it |
| easier to slot in PCRE2 as a replacement library. Other POSIX options |
| are not even defined. |
| |
| There are also some options that are not defined by POSIX. These have |
| been added at the request of users who want to make use of certain |
| PCRE2-specific features via the POSIX calling interface or to add BSD |
| or GNU functionality. |
| |
| When PCRE2 is called via these functions, it is only the API that is |
| POSIX-like in style. The syntax and semantics of the regular expres- |
| sions themselves are still those of Perl, subject to the setting of |
| various PCRE2 options, as described below. "POSIX-like in style" means |
| that the API approximates to the POSIX definition; it is not fully |
| POSIX-compatible, and in multi-unit encoding domains it is probably |
| even less compatible. |
| |
| The descriptions below use the actual names of the functions, but, as |
| described above, the standard POSIX names (without the pcre2_ prefix) |
| may also be used. |
| |
| |
| COMPILING A PATTERN |
| |
| The function pcre2_regcomp() is called to compile a pattern into an in- |
| ternal form. By default, the pattern is a C string terminated by a bi- |
| nary zero (but see REG_PEND below). The preg argument is a pointer to a |
| regex_t structure that is used as a base for storing information about |
| the compiled regular expression. (It is also used for input when |
| REG_PEND is set.) |
| |
| The argument cflags is either zero, or contains one or more of the bits |
| defined by the following macros: |
| |
| REG_DOTALL |
| |
| The PCRE2_DOTALL option is set when the regular expression is passed |
| for compilation to the native function. Note that REG_DOTALL is not |
| part of the POSIX standard. |
| |
| REG_ICASE |
| |
| The PCRE2_CASELESS option is set when the regular expression is passed |
| for compilation to the native function. |
| |
| REG_NEWLINE |
| |
| The PCRE2_MULTILINE option is set when the regular expression is passed |
| for compilation to the native function. Note that this does not mimic |
| the defined POSIX behaviour for REG_NEWLINE (see the following sec- |
| tion). |
| |
| REG_NOSPEC |
| |
| The PCRE2_LITERAL option is set when the regular expression is passed |
| for compilation to the native function. This disables all meta charac- |
| ters in the pattern, causing it to be treated as a literal string. The |
| only other options that are allowed with REG_NOSPEC are REG_ICASE, |
| REG_NOSUB, REG_PEND, and REG_UTF. Note that REG_NOSPEC is not part of |
| the POSIX standard. |
| |
| REG_NOSUB |
| |
| When a pattern that is compiled with this flag is passed to |
| pcre2_regexec() for matching, the nmatch and pmatch arguments are ig- |
| nored, and no captured strings are returned. Versions of the PCRE li- |
| brary prior to 10.22 used to set the PCRE2_NO_AUTO_CAPTURE compile op- |
| tion, but this no longer happens because it disables the use of back- |
| references. |
| |
| REG_PEND |
| |
| If this option is set, the reg_endp field in the preg structure (which |
| has the type const char *) must be set to point to the character beyond |
| the end of the pattern before calling pcre2_regcomp(). The pattern it- |
| self may now contain binary zeros, which are treated as data charac- |
| ters. Without REG_PEND, a binary zero terminates the pattern and the |
| re_endp field is ignored. This is a GNU extension to the POSIX standard |
| and should be used with caution in software intended to be portable to |
| other systems. |
| |
| REG_UCP |
| |
| The PCRE2_UCP option is set when the regular expression is passed for |
| compilation to the native function. This causes PCRE2 to use Unicode |
| properties when matchine \d, \w, etc., instead of just recognizing |
| ASCII values. Note that REG_UCP is not part of the POSIX standard. |
| |
| REG_UNGREEDY |
| |
| The PCRE2_UNGREEDY option is set when the regular expression is passed |
| for compilation to the native function. Note that REG_UNGREEDY is not |
| part of the POSIX standard. |
| |
| REG_UTF |
| |
| The PCRE2_UTF option is set when the regular expression is passed for |
| compilation to the native function. This causes the pattern itself and |
| all data strings used for matching it to be treated as UTF-8 strings. |
| Note that REG_UTF is not part of the POSIX standard. |
| |
| In the absence of these flags, no options are passed to the native |
| function. This means the the regex is compiled with PCRE2 default se- |
| mantics. In particular, the way it handles newline characters in the |
| subject string is the Perl way, not the POSIX way. Note that setting |
| PCRE2_MULTILINE has only some of the effects specified for REG_NEWLINE. |
| It does not affect the way newlines are matched by the dot metacharac- |
| ter (they are not) or by a negative class such as [^a] (they are). |
| |
| The yield of pcre2_regcomp() is zero on success, and non-zero other- |
| wise. The preg structure is filled in on success, and one other member |
| of the structure (as well as re_endp) is public: re_nsub contains the |
| number of capturing subpatterns in the regular expression. Various er- |
| ror codes are defined in the header file. |
| |
| NOTE: If the yield of pcre2_regcomp() is non-zero, you must not attempt |
| to use the contents of the preg structure. If, for example, you pass it |
| to pcre2_regexec(), the result is undefined and your program is likely |
| to crash. |
| |
| |
| MATCHING NEWLINE CHARACTERS |
| |
| This area is not simple, because POSIX and Perl take different views of |
| things. It is not possible to get PCRE2 to obey POSIX semantics, but |
| then PCRE2 was never intended to be a POSIX engine. The following table |
| lists the different possibilities for matching newline characters in |
| Perl and PCRE2: |
| |
| Default Change with |
| |
| . matches newline no PCRE2_DOTALL |
| newline matches [^a] yes not changeable |
| $ matches \n at end yes PCRE2_DOLLAR_ENDONLY |
| $ matches \n in middle no PCRE2_MULTILINE |
| ^ matches \n in middle no PCRE2_MULTILINE |
| |
| This is the equivalent table for a POSIX-compatible pattern matcher: |
| |
| Default Change with |
| |
| . matches newline yes REG_NEWLINE |
| newline matches [^a] yes REG_NEWLINE |
| $ matches \n at end no REG_NEWLINE |
| $ matches \n in middle no REG_NEWLINE |
| ^ matches \n in middle no REG_NEWLINE |
| |
| This behaviour is not what happens when PCRE2 is called via its POSIX |
| API. By default, PCRE2's behaviour is the same as Perl's, except that |
| there is no equivalent for PCRE2_DOLLAR_ENDONLY in Perl. In both PCRE2 |
| and Perl, there is no way to stop newline from matching [^a]. |
| |
| Default POSIX newline handling can be obtained by setting PCRE2_DOTALL |
| and PCRE2_DOLLAR_ENDONLY when calling pcre2_compile() directly, but |
| there is no way to make PCRE2 behave exactly as for the REG_NEWLINE ac- |
| tion. When using the POSIX API, passing REG_NEWLINE to PCRE2's |
| pcre2_regcomp() function causes PCRE2_MULTILINE to be passed to |
| pcre2_compile(), and REG_DOTALL passes PCRE2_DOTALL. There is no way to |
| pass PCRE2_DOLLAR_ENDONLY. |
| |
| |
| MATCHING A PATTERN |
| |
| The function pcre2_regexec() is called to match a compiled pattern preg |
| against a given string, which is by default terminated by a zero byte |
| (but see REG_STARTEND below), subject to the options in eflags. These |
| can be: |
| |
| REG_NOTBOL |
| |
| The PCRE2_NOTBOL option is set when calling the underlying PCRE2 match- |
| ing function. |
| |
| REG_NOTEMPTY |
| |
| The PCRE2_NOTEMPTY option is set when calling the underlying PCRE2 |
| matching function. Note that REG_NOTEMPTY is not part of the POSIX |
| standard. However, setting this option can give more POSIX-like behav- |
| iour in some situations. |
| |
| REG_NOTEOL |
| |
| The PCRE2_NOTEOL option is set when calling the underlying PCRE2 match- |
| ing function. |
| |
| REG_STARTEND |
| |
| When this option is set, the subject string starts at string + |
| pmatch[0].rm_so and ends at string + pmatch[0].rm_eo, which should |
| point to the first character beyond the string. There may be binary ze- |
| ros within the subject string, and indeed, using REG_STARTEND is the |
| only way to pass a subject string that contains a binary zero. |
| |
| Whatever the value of pmatch[0].rm_so, the offsets of the matched |
| string and any captured substrings are still given relative to the |
| start of string itself. (Before PCRE2 release 10.30 these were given |
| relative to string + pmatch[0].rm_so, but this differs from other im- |
| plementations.) |
| |
| This is a BSD extension, compatible with but not specified by IEEE |
| Standard 1003.2 (POSIX.2), and should be used with caution in software |
| intended to be portable to other systems. Note that a non-zero rm_so |
| does not imply REG_NOTBOL; REG_STARTEND affects only the location and |
| length of the string, not how it is matched. Setting REG_STARTEND and |
| passing pmatch as NULL are mutually exclusive; the error REG_INVARG is |
| returned. |
| |
| If the pattern was compiled with the REG_NOSUB flag, no data about any |
| matched strings is returned. The nmatch and pmatch arguments of |
| pcre2_regexec() are ignored (except possibly as input for REG_STAR- |
| TEND). |
| |
| The value of nmatch may be zero, and the value pmatch may be NULL (un- |
| less REG_STARTEND is set); in both these cases no data about any |
| matched strings is returned. |
| |
| Otherwise, the portion of the string that was matched, and also any |
| captured substrings, are returned via the pmatch argument, which points |
| to an array of nmatch structures of type regmatch_t, containing the |
| members rm_so and rm_eo. These contain the byte offset to the first |
| character of each substring and the offset to the first character after |
| the end of each substring, respectively. The 0th element of the vector |
| relates to the entire portion of string that was matched; subsequent |
| elements relate to the capturing subpatterns of the regular expression. |
| Unused entries in the array have both structure members set to -1. |
| |
| A successful match yields a zero return; various error codes are de- |
| fined in the header file, of which REG_NOMATCH is the "expected" fail- |
| ure code. |
| |
| |
| ERROR MESSAGES |
| |
| The pcre2_regerror() function maps a non-zero errorcode from either |
| pcre2_regcomp() or pcre2_regexec() to a printable message. If preg is |
| not NULL, the error should have arisen from the use of that structure. |
| A message terminated by a binary zero is placed in errbuf. If the buf- |
| fer is too short, only the first errbuf_size - 1 characters of the er- |
| ror message are used. The yield of the function is the size of buffer |
| needed to hold the whole message, including the terminating zero. This |
| value is greater than errbuf_size if the message was truncated. |
| |
| |
| MEMORY USAGE |
| |
| Compiling a regular expression causes memory to be allocated and asso- |
| ciated with the preg structure. The function pcre2_regfree() frees all |
| such memory, after which preg may no longer be used as a compiled ex- |
| pression. |
| |
| |
| AUTHOR |
| |
| Philip Hazel |
| University Computing Service |
| Cambridge, England. |
| |
| |
| REVISION |
| |
| Last updated: 26 April 2021 |
| Copyright (c) 1997-2021 University of Cambridge. |
| ------------------------------------------------------------------------------ |
| |
| |
| PCRE2SAMPLE(3) Library Functions Manual PCRE2SAMPLE(3) |
| |
| |
| |
| NAME |
| PCRE2 - Perl-compatible regular expressions (revised API) |
| |
| PCRE2 SAMPLE PROGRAM |
| |
| A simple, complete demonstration program to get you started with using |
| PCRE2 is supplied in the file pcre2demo.c in the src directory in the |
| PCRE2 distribution. A listing of this program is given in the pcre2demo |
| documentation. If you do not have a copy of the PCRE2 distribution, you |
| can save this listing to re-create the contents of pcre2demo.c. |
| |
| The demonstration program compiles the regular expression that is its |
| first argument, and matches it against the subject string in its second |
| argument. No PCRE2 options are set, and default character tables are |
| used. If matching succeeds, the program outputs the portion of the sub- |
| ject that matched, together with the contents of any captured sub- |
| strings. |
| |
| If the -g option is given on the command line, the program then goes on |
| to check for further matches of the same regular expression in the same |
| subject string. The logic is a little bit tricky because of the possi- |
| bility of matching an empty string. Comments in the code explain what |
| is going on. |
| |
| The code in pcre2demo.c is an 8-bit program that uses the PCRE2 8-bit |
| library. It handles strings and characters that are stored in 8-bit |
| code units. By default, one character corresponds to one code unit, |
| but if the pattern starts with "(*UTF)", both it and the subject are |
| treated as UTF-8 strings, where characters may occupy multiple code |
| units. |
| |
| If PCRE2 is installed in the standard include and library directories |
| for your operating system, you should be able to compile the demonstra- |
| tion program using a command like this: |
| |
| cc -o pcre2demo pcre2demo.c -lpcre2-8 |
| |
| If PCRE2 is installed elsewhere, you may need to add additional options |
| to the command line. For example, on a Unix-like system that has PCRE2 |
| installed in /usr/local, you can compile the demonstration program us- |
| ing a command like this: |
| |
| cc -o pcre2demo -I/usr/local/include pcre2demo.c \ |
| -L/usr/local/lib -lpcre2-8 |
| |
| Once you have built the demonstration program, you can run simple tests |
| like this: |
| |
| ./pcre2demo 'cat|dog' 'the cat sat on the mat' |
| ./pcre2demo -g 'cat|dog' 'the dog sat on the cat' |
| |
| Note that there is a much more comprehensive test program, called |
| pcre2test, which supports many more facilities for testing regular ex- |
| pressions using all three PCRE2 libraries (8-bit, 16-bit, and 32-bit, |
| though not all three need be installed). The pcre2demo program is pro- |
| vided as a relatively simple coding example. |
| |
| If you try to run pcre2demo when PCRE2 is not installed in the standard |
| library directory, you may get an error like this on some operating |
| systems (e.g. Solaris): |
| |
| ld.so.1: pcre2demo: fatal: libpcre2-8.so.0: open failed: No such file |
| or directory |
| |
| This is caused by the way shared library support works on those sys- |
| tems. You need to add |
| |
| -R/usr/local/lib |
| |
| (for example) to the compile command to get round this problem. |
| |
| |
| AUTHOR |
| |
| Philip Hazel |
| University Computing Service |
| Cambridge, England. |
| |
| |
| REVISION |
| |
| Last updated: 02 February 2016 |
| Copyright (c) 1997-2016 University of Cambridge. |
| ------------------------------------------------------------------------------ |
| PCRE2SERIALIZE(3) Library Functions Manual PCRE2SERIALIZE(3) |
| |
| |
| |
| NAME |
| PCRE2 - Perl-compatible regular expressions (revised API) |
| |
| SAVING AND RE-USING PRECOMPILED PCRE2 PATTERNS |
| |
| int32_t pcre2_serialize_decode(pcre2_code **codes, |
| int32_t number_of_codes, const uint8_t *bytes, |
| pcre2_general_context *gcontext); |
| |
| int32_t pcre2_serialize_encode(const pcre2_code **codes, |
| int32_t number_of_codes, uint8_t **serialized_bytes, |
| PCRE2_SIZE *serialized_size, pcre2_general_context *gcontext); |
| |
| void pcre2_serialize_free(uint8_t *bytes); |
| |
| int32_t pcre2_serialize_get_number_of_codes(const uint8_t *bytes); |
| |
| If you are running an application that uses a large number of regular |
| expression patterns, it may be useful to store them in a precompiled |
| form instead of having to compile them every time the application is |
| run. However, if you are using the just-in-time optimization feature, |
| it is not possible to save and reload the JIT data, because it is posi- |
| tion-dependent. The host on which the patterns are reloaded must be |
| running the same version of PCRE2, with the same code unit width, and |
| must also have the same endianness, pointer width and PCRE2_SIZE type. |
| For example, patterns compiled on a 32-bit system using PCRE2's 16-bit |
| library cannot be reloaded on a 64-bit system, nor can they be reloaded |
| using the 8-bit library. |
| |
| Note that "serialization" in PCRE2 does not convert compiled patterns |
| to an abstract format like Java or .NET serialization. The serialized |
| output is really just a bytecode dump, which is why it can only be |
| reloaded in the same environment as the one that created it. Hence the |
| restrictions mentioned above. Applications that are not statically |
| linked with a fixed version of PCRE2 must be prepared to recompile pat- |
| terns from their sources, in order to be immune to PCRE2 upgrades. |
| |
| |
| SECURITY CONCERNS |
| |
| The facility for saving and restoring compiled patterns is intended for |
| use within individual applications. As such, the data supplied to |
| pcre2_serialize_decode() is expected to be trusted data, not data from |
| arbitrary external sources. There is only some simple consistency |
| checking, not complete validation of what is being re-loaded. Corrupted |
| data may cause undefined results. For example, if the length field of a |
| pattern in the serialized data is corrupted, the deserializing code may |
| read beyond the end of the byte stream that is passed to it. |
| |
| |
| SAVING COMPILED PATTERNS |
| |
| Before compiled patterns can be saved they must be serialized, which in |
| PCRE2 means converting the pattern to a stream of bytes. A single byte |
| stream may contain any number of compiled patterns, but they must all |
| use the same character tables. A single copy of the tables is included |
| in the byte stream (its size is 1088 bytes). For more details of char- |
| acter tables, see the section on locale support in the pcre2api docu- |
| mentation. |
| |
| The function pcre2_serialize_encode() creates a serialized byte stream |
| from a list of compiled patterns. Its first two arguments specify the |
| list, being a pointer to a vector of pointers to compiled patterns, and |
| the length of the vector. The third and fourth arguments point to vari- |
| ables which are set to point to the created byte stream and its length, |
| respectively. The final argument is a pointer to a general context, |
| which can be used to specify custom memory mangagement functions. If |
| this argument is NULL, malloc() is used to obtain memory for the byte |
| stream. The yield of the function is the number of serialized patterns, |
| or one of the following negative error codes: |
| |
| PCRE2_ERROR_BADDATA the number of patterns is zero or less |
| PCRE2_ERROR_BADMAGIC mismatch of id bytes in one of the patterns |
| PCRE2_ERROR_NOMEMORY memory allocation failed |
| PCRE2_ERROR_MIXEDTABLES the patterns do not all use the same tables |
| PCRE2_ERROR_NULL the 1st, 3rd, or 4th argument is NULL |
| |
| PCRE2_ERROR_BADMAGIC means either that a pattern's code has been cor- |
| rupted, or that a slot in the vector does not point to a compiled pat- |
| tern. |
| |
| Once a set of patterns has been serialized you can save the data in any |
| appropriate manner. Here is sample code that compiles two patterns and |
| writes them to a file. It assumes that the variable fd refers to a file |
| that is open for output. The error checking that should be present in a |
| real application has been omitted for simplicity. |
| |
| int errorcode; |
| uint8_t *bytes; |
| PCRE2_SIZE erroroffset; |
| PCRE2_SIZE bytescount; |
| pcre2_code *list_of_codes[2]; |
| list_of_codes[0] = pcre2_compile("first pattern", |
| PCRE2_ZERO_TERMINATED, 0, &errorcode, &erroroffset, NULL); |
| list_of_codes[1] = pcre2_compile("second pattern", |
| PCRE2_ZERO_TERMINATED, 0, &errorcode, &erroroffset, NULL); |
| errorcode = pcre2_serialize_encode(list_of_codes, 2, &bytes, |
| &bytescount, NULL); |
| errorcode = fwrite(bytes, 1, bytescount, fd); |
| |
| Note that the serialized data is binary data that may contain any of |
| the 256 possible byte values. On systems that make a distinction be- |
| tween binary and non-binary data, be sure that the file is opened for |
| binary output. |
| |
| Serializing a set of patterns leaves the original data untouched, so |
| they can still be used for matching. Their memory must eventually be |
| freed in the usual way by calling pcre2_code_free(). When you have fin- |
| ished with the byte stream, it too must be freed by calling pcre2_seri- |
| alize_free(). If this function is called with a NULL argument, it re- |
| turns immediately without doing anything. |
| |
| |
| RE-USING PRECOMPILED PATTERNS |
| |
| In order to re-use a set of saved patterns you must first make the se- |
| rialized byte stream available in main memory (for example, by reading |
| from a file). The management of this memory block is up to the applica- |
| tion. You can use the pcre2_serialize_get_number_of_codes() function to |
| find out how many compiled patterns are in the serialized data without |
| actually decoding the patterns: |
| |
| uint8_t *bytes = <serialized data>; |
| int32_t number_of_codes = pcre2_serialize_get_number_of_codes(bytes); |
| |
| The pcre2_serialize_decode() function reads a byte stream and recreates |
| the compiled patterns in new memory blocks, setting pointers to them in |
| a vector. The first two arguments are a pointer to a suitable vector |
| and its length, and the third argument points to a byte stream. The fi- |
| nal argument is a pointer to a general context, which can be used to |
| specify custom memory mangagement functions for the decoded patterns. |
| If this argument is NULL, malloc() and free() are used. After deserial- |
| ization, the byte stream is no longer needed and can be discarded. |
| |
| pcre2_code *list_of_codes[2]; |
| uint8_t *bytes = <serialized data>; |
| int32_t number_of_codes = |
| pcre2_serialize_decode(list_of_codes, 2, bytes, NULL); |
| |
| If the vector is not large enough for all the patterns in the byte |
| stream, it is filled with those that fit, and the remainder are ig- |
| nored. The yield of the function is the number of decoded patterns, or |
| one of the following negative error codes: |
| |
| PCRE2_ERROR_BADDATA second argument is zero or less |
| PCRE2_ERROR_BADMAGIC mismatch of id bytes in the data |
| PCRE2_ERROR_BADMODE mismatch of code unit size or PCRE2 version |
| PCRE2_ERROR_BADSERIALIZEDDATA other sanity check failure |
| PCRE2_ERROR_MEMORY memory allocation failed |
| PCRE2_ERROR_NULL first or third argument is NULL |
| |
| PCRE2_ERROR_BADMAGIC may mean that the data is corrupt, or that it was |
| compiled on a system with different endianness. |
| |
| Decoded patterns can be used for matching in the usual way, and must be |
| freed by calling pcre2_code_free(). However, be aware that there is a |
| potential race issue if you are using multiple patterns that were de- |
| coded from a single byte stream in a multithreaded application. A sin- |
| gle copy of the character tables is used by all the decoded patterns |
| and a reference count is used to arrange for its memory to be automati- |
| cally freed when the last pattern is freed, but there is no locking on |
| this reference count. Therefore, if you want to call pcre2_code_free() |
| for these patterns in different threads, you must arrange your own |
| locking, and ensure that pcre2_code_free() cannot be called by two |
| threads at the same time. |
| |
| If a pattern was processed by pcre2_jit_compile() before being serial- |
| ized, the JIT data is discarded and so is no longer available after a |
| save/restore cycle. You can, however, process a restored pattern with |
| pcre2_jit_compile() if you wish. |
| |
| |
| AUTHOR |
| |
| Philip Hazel |
| University Computing Service |
| Cambridge, England. |
| |
| |
| REVISION |
| |
| Last updated: 27 June 2018 |
| Copyright (c) 1997-2018 University of Cambridge. |
| ------------------------------------------------------------------------------ |
| |
| |
| PCRE2SYNTAX(3) Library Functions Manual PCRE2SYNTAX(3) |
| |
| |
| |
| NAME |
| PCRE2 - Perl-compatible regular expressions (revised API) |
| |
| PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY |
| |
| The full syntax and semantics of the regular expressions that are sup- |
| ported by PCRE2 are described in the pcre2pattern documentation. This |
| document contains a quick-reference summary of the syntax. |
| |
| |
| QUOTING |
| |
| \x where x is non-alphanumeric is a literal x |
| \Q...\E treat enclosed characters as literal |
| |
| |
| ESCAPED CHARACTERS |
| |
| This table applies to ASCII and Unicode environments. An unrecognized |
| escape sequence causes an error. |
| |
| \a alarm, that is, the BEL character (hex 07) |
| \cx "control-x", where x is any ASCII printing character |
| \e escape (hex 1B) |
| \f form feed (hex 0C) |
| \n newline (hex 0A) |
| \r carriage return (hex 0D) |
| \t tab (hex 09) |
| \0dd character with octal code 0dd |
| \ddd character with octal code ddd, or backreference |
| \o{ddd..} character with octal code ddd.. |
| \N{U+hh..} character with Unicode code point hh.. (Unicode mode only) |
| \xhh character with hex code hh |
| \x{hh..} character with hex code hh.. |
| |
| If PCRE2_ALT_BSUX or PCRE2_EXTRA_ALT_BSUX is set ("ALT_BSUX mode"), the |
| following are also recognized: |
| |
| \U the character "U" |
| \uhhhh character with hex code hhhh |
| \u{hh..} character with hex code hh.. but only for EXTRA_ALT_BSUX |
| |
| When \x is not followed by {, from zero to two hexadecimal digits are |
| read, but in ALT_BSUX mode \x must be followed by two hexadecimal dig- |
| its to be recognized as a hexadecimal escape; otherwise it matches a |
| literal "x". Likewise, if \u (in ALT_BSUX mode) is not followed by |
| four hexadecimal digits or (in EXTRA_ALT_BSUX mode) a sequence of hex |
| digits in curly brackets, it matches a literal "u". |
| |
| Note that \0dd is always an octal code. The treatment of backslash fol- |
| lowed by a non-zero digit is complicated; for details see the section |
| "Non-printing characters" in the pcre2pattern documentation, where de- |
| tails of escape processing in EBCDIC environments are also given. |
| \N{U+hh..} is synonymous with \x{hh..} in PCRE2 but is not supported in |
| EBCDIC environments. Note that \N not followed by an opening curly |
| bracket has a different meaning (see below). |
| |
| |
| CHARACTER TYPES |
| |
| . any character except newline; |
| in dotall mode, any character whatsoever |
| \C one code unit, even in UTF mode (best avoided) |
| \d a decimal digit |
| \D a character that is not a decimal digit |
| \h a horizontal white space character |
| \H a character that is not a horizontal white space character |
| \N a character that is not a newline |
| \p{xx} a character with the xx property |
| \P{xx} a character without the xx property |
| \R a newline sequence |
| \s a white space character |
| \S a character that is not a white space character |
| \v a vertical white space character |
| \V a character that is not a vertical white space character |
| \w a "word" character |
| \W a "non-word" character |
| \X a Unicode extended grapheme cluster |
| |
| \C is dangerous because it may leave the current matching point in the |
| middle of a UTF-8 or UTF-16 character. The application can lock out the |
| use of \C by setting the PCRE2_NEVER_BACKSLASH_C option. It is also |
| possible to build PCRE2 with the use of \C permanently disabled. |
| |
| By default, \d, \s, and \w match only ASCII characters, even in UTF-8 |
| mode or in the 16-bit and 32-bit libraries. However, if locale-specific |
| matching is happening, \s and \w may also match characters with code |
| points in the range 128-255. If the PCRE2_UCP option is set, the behav- |
| iour of these escape sequences is changed to use Unicode properties and |
| they match many more characters. |
| |
| Property descriptions in \p and \P are matched caselessly; hyphens, un- |
| derscores, and white space are ignored, in accordance with Unicode's |
| "loose matching" rules. |
| |
| |
| GENERAL CATEGORY PROPERTIES FOR \p and \P |
| |
| C Other |
| Cc Control |
| Cf Format |
| Cn Unassigned |
| Co Private use |
| Cs Surrogate |
| |
| L Letter |
| Ll Lower case letter |
| Lm Modifier letter |
| Lo Other letter |
| Lt Title case letter |
| Lu Upper case letter |
| Lc Ll, Lu, or Lt |
| L& Ll, Lu, or Lt |
| |
| M Mark |
| Mc Spacing mark |
| Me Enclosing mark |
| Mn Non-spacing mark |
| |
| N Number |
| Nd Decimal number |
| Nl Letter number |
| No Other number |
| |
| P Punctuation |
| Pc Connector punctuation |
| Pd Dash punctuation |
| Pe Close punctuation |
| Pf Final punctuation |
| Pi Initial punctuation |
| Po Other punctuation |
| Ps Open punctuation |
| |
| S Symbol |
| Sc Currency symbol |
| Sk Modifier symbol |
| Sm Mathematical symbol |
| So Other symbol |
| |
| Z Separator |
| Zl Line separator |
| Zp Paragraph separator |
| Zs Space separator |
| |
| |
| PCRE2 SPECIAL CATEGORY PROPERTIES FOR \p and \P |
| |
| Xan Alphanumeric: union of properties L and N |
| Xps POSIX space: property Z or tab, NL, VT, FF, CR |
| Xsp Perl space: property Z or tab, NL, VT, FF, CR |
| Xuc Univerally-named character: one that can be |
| represented by a Universal Character Name |
| Xwd Perl word: property Xan or underscore |
| |
| Perl and POSIX space are now the same. Perl added VT to its space char- |
| acter set at release 5.18. |
| |
| |
| BINARY PROPERTIES FOR \p AND \P |
| |
| Unicode defines a number of binary properties, that is, properties |
| whose only values are true or false. You can obtain a list of those |
| that are recognized by \p and \P, along with their abbreviations, by |
| running this command: |
| |
| pcre2test -LP |
| |
| |
| SCRIPT MATCHING WITH \p AND \P |
| |
| Many script names and their 4-letter abbreviations are recognized in |
| \p{sc:...} or \p{scx:...} items, or on their own with \p (and also \P |
| of course). You can obtain a list of these scripts by running this com- |
| mand: |
| |
| pcre2test -LS |
| |
| |
| THE BIDI_CLASS PROPERTY FOR \p AND \P |
| |
| \p{Bidi_Class:<class>} matches a character with the given class |
| \p{BC:<class>} matches a character with the given class |
| |
| The recognized classes are: |
| |
| AL Arabic letter |
| AN Arabic number |
| B paragraph separator |
| BN boundary neutral |
| CS common separator |
| EN European number |
| ES European separator |
| ET European terminator |
| FSI first strong isolate |
| L left-to-right |
| LRE left-to-right embedding |
| LRI left-to-right isolate |
| LRO left-to-right override |
| NSM non-spacing mark |
| ON other neutral |
| PDF pop directional format |
| PDI pop directional isolate |
| R right-to-left |
| RLE right-to-left embedding |
| RLI right-to-left isolate |
| RLO right-to-left override |
| S segment separator |
| WS which space |
| |
| |
| CHARACTER CLASSES |
| |
| [...] positive character class |
| [^...] negative character class |
| [x-y] range (can be used for hex characters) |
| [[:xxx:]] positive POSIX named set |
| [[:^xxx:]] negative POSIX named set |
| |
| alnum alphanumeric |
| alpha alphabetic |
| ascii 0-127 |
| blank space or tab |
| cntrl control character |
| digit decimal digit |
| graph printing, excluding space |
| lower lower case letter |
| print printing, including space |
| punct printing, excluding alphanumeric |
| space white space |
| upper upper case letter |
| word same as \w |
| xdigit hexadecimal digit |
| |
| In PCRE2, POSIX character set names recognize only ASCII characters by |
| default, but some of them use Unicode properties if PCRE2_UCP is set. |
| You can use \Q...\E inside a character class. |
| |
| |
| QUANTIFIERS |
| |
| ? 0 or 1, greedy |
| ?+ 0 or 1, possessive |
| ?? 0 or 1, lazy |
| * 0 or more, greedy |
| *+ 0 or more, possessive |
| *? 0 or more, lazy |
| + 1 or more, greedy |
| ++ 1 or more, possessive |
| +? 1 or more, lazy |
| {n} exactly n |
| {n,m} at least n, no more than m, greedy |
| {n,m}+ at least n, no more than m, possessive |
| {n,m}? at least n, no more than m, lazy |
| {n,} n or more, greedy |
| {n,}+ n or more, possessive |
| {n,}? n or more, lazy |
| |
| |
| ANCHORS AND SIMPLE ASSERTIONS |
| |
| \b word boundary |
| \B not a word boundary |
| ^ start of subject |
| also after an internal newline in multiline mode |
| (after any newline if PCRE2_ALT_CIRCUMFLEX is set) |
| \A start of subject |
| $ end of subject |
| also before newline at end of subject |
| also before internal newline in multiline mode |
| \Z end of subject |
| also before newline at end of subject |
| \z end of subject |
| \G first matching position in subject |
| |
| |
| REPORTED MATCH POINT SETTING |
| |
| \K set reported start of match |
| |
| From release 10.38 \K is not permitted by default in lookaround asser- |
| tions, for compatibility with Perl. However, if the PCRE2_EXTRA_AL- |
| LOW_LOOKAROUND_BSK option is set, the previous behaviour is re-enabled. |
| When this option is set, \K is honoured in positive assertions, but ig- |
| nored in negative ones. |
| |
| |
| ALTERNATION |
| |
| expr|expr|expr... |
| |
| |
| CAPTURING |
| |
| (...) capture group |
| (?<name>...) named capture group (Perl) |
| (?'name'...) named capture group (Perl) |
| (?P<name>...) named capture group (Python) |
| (?:...) non-capture group |
| (?|...) non-capture group; reset group numbers for |
| capture groups in each alternative |
| |
| In non-UTF modes, names may contain underscores and ASCII letters and |
| digits; in UTF modes, any Unicode letters and Unicode decimal digits |
| are permitted. In both cases, a name must not start with a digit. |
| |
| |
| ATOMIC GROUPS |
| |
| (?>...) atomic non-capture group |
| (*atomic:...) atomic non-capture group |
| |
| |
| COMMENT |
| |
| (?#....) comment (not nestable) |
| |
| |
| OPTION SETTING |
| Changes of these options within a group are automatically cancelled at |
| the end of the group. |
| |
| (?i) caseless |
| (?J) allow duplicate named groups |
| (?m) multiline |
| (?n) no auto capture |
| (?s) single line (dotall) |
| (?U) default ungreedy (lazy) |
| (?x) extended: ignore white space except in classes |
| (?xx) as (?x) but also ignore space and tab in classes |
| (?-...) unset option(s) |
| (?^) unset imnsx options |
| |
| Unsetting x or xx unsets both. Several options may be set at once, and |
| a mixture of setting and unsetting such as (?i-x) is allowed, but there |
| may be only one hyphen. Setting (but no unsetting) is allowed after (?^ |
| for example (?^in). An option setting may appear at the start of a non- |
| capture group, for example (?i:...). |
| |
| The following are recognized only at the very start of a pattern or af- |
| ter one of the newline or \R options with similar syntax. More than one |
| of them may appear. For the first three, d is a decimal number. |
| |
| (*LIMIT_DEPTH=d) set the backtracking limit to d |
| (*LIMIT_HEAP=d) set the heap size limit to d * 1024 bytes |
| (*LIMIT_MATCH=d) set the match limit to d |
| (*NOTEMPTY) set PCRE2_NOTEMPTY when matching |
| (*NOTEMPTY_ATSTART) set PCRE2_NOTEMPTY_ATSTART when matching |
| (*NO_AUTO_POSSESS) no auto-possessification (PCRE2_NO_AUTO_POSSESS) |
| (*NO_DOTSTAR_ANCHOR) no .* anchoring (PCRE2_NO_DOTSTAR_ANCHOR) |
| (*NO_JIT) disable JIT optimization |
| (*NO_START_OPT) no start-match optimization (PCRE2_NO_START_OPTIMIZE) |
| (*UTF) set appropriate UTF mode for the library in use |
| (*UCP) set PCRE2_UCP (use Unicode properties for \d etc) |
| |
| Note that LIMIT_DEPTH, LIMIT_HEAP, and LIMIT_MATCH can only reduce the |
| value of the limits set by the caller of pcre2_match() or |
| pcre2_dfa_match(), not increase them. LIMIT_RECURSION is an obsolete |
| synonym for LIMIT_DEPTH. The application can lock out the use of (*UTF) |
| and (*UCP) by setting the PCRE2_NEVER_UTF or PCRE2_NEVER_UCP options, |
| respectively, at compile time. |
| |
| |
| NEWLINE CONVENTION |
| |
| These are recognized only at the very start of the pattern or after op- |
| tion settings with a similar syntax. |
| |
| (*CR) carriage return only |
| (*LF) linefeed only |
| (*CRLF) carriage return followed by linefeed |
| (*ANYCRLF) all three of the above |
| (*ANY) any Unicode newline sequence |
| (*NUL) the NUL character (binary zero) |
| |
| |
| WHAT \R MATCHES |
| |
| These are recognized only at the very start of the pattern or after op- |
| tion setting with a similar syntax. |
| |
| (*BSR_ANYCRLF) CR, LF, or CRLF |
| (*BSR_UNICODE) any Unicode newline sequence |
| |
| |
| LOOKAHEAD AND LOOKBEHIND ASSERTIONS |
| |
| (?=...) ) |
| (*pla:...) ) positive lookahead |
| (*positive_lookahead:...) ) |
| |
| (?!...) ) |
| (*nla:...) ) negative lookahead |
| (*negative_lookahead:...) ) |
| |
| (?<=...) ) |
| (*plb:...) ) positive lookbehind |
| (*positive_lookbehind:...) ) |
| |
| (?<!...) ) |
| (*nlb:...) ) negative lookbehind |
| (*negative_lookbehind:...) ) |
| |
| Each top-level branch of a lookbehind must be of a fixed length. |
| |
| |
| NON-ATOMIC LOOKAROUND ASSERTIONS |
| |
| These assertions are specific to PCRE2 and are not Perl-compatible. |
| |
| (?*...) ) |
| (*napla:...) ) synonyms |
| (*non_atomic_positive_lookahead:...) ) |
| |
| (?<*...) ) |
| (*naplb:...) ) synonyms |
| (*non_atomic_positive_lookbehind:...) ) |
| |
| |
| SCRIPT RUNS |
| |
| (*script_run:...) ) script run, can be backtracked into |
| (*sr:...) ) |
| |
| (*atomic_script_run:...) ) atomic script run |
| (*asr:...) ) |
| |
| |
| BACKREFERENCES |
| |
| \n reference by number (can be ambiguous) |
| \gn reference by number |
| \g{n} reference by number |
| \g+n relative reference by number (PCRE2 extension) |
| \g-n relative reference by number |
| \g{+n} relative reference by number (PCRE2 extension) |
| \g{-n} relative reference by number |
| \k<name> reference by name (Perl) |
| \k'name' reference by name (Perl) |
| \g{name} reference by name (Perl) |
| \k{name} reference by name (.NET) |
| (?P=name) reference by name (Python) |
| |
| |
| SUBROUTINE REFERENCES (POSSIBLY RECURSIVE) |
| |
| (?R) recurse whole pattern |
| (?n) call subroutine by absolute number |
| (?+n) call subroutine by relative number |
| (?-n) call subroutine by relative number |
| (?&name) call subroutine by name (Perl) |
| (?P>name) call subroutine by name (Python) |
| \g<name> call subroutine by name (Oniguruma) |
| \g'name' call subroutine by name (Oniguruma) |
| \g<n> call subroutine by absolute number (Oniguruma) |
| \g'n' call subroutine by absolute number (Oniguruma) |
| \g<+n> call subroutine by relative number (PCRE2 extension) |
| \g'+n' call subroutine by relative number (PCRE2 extension) |
| \g<-n> call subroutine by relative number (PCRE2 extension) |
| \g'-n' call subroutine by relative number (PCRE2 extension) |
| |
| |
| CONDITIONAL PATTERNS |
| |
| (?(condition)yes-pattern) |
| (?(condition)yes-pattern|no-pattern) |
| |
| (?(n) absolute reference condition |
| (?(+n) relative reference condition |
| (?(-n) relative reference condition |
| (?(<name>) named reference condition (Perl) |
| (?('name') named reference condition (Perl) |
| (?(name) named reference condition (PCRE2, deprecated) |
| (?(R) overall recursion condition |
| (?(Rn) specific numbered group recursion condition |
| (?(R&name) specific named group recursion condition |
| (?(DEFINE) define groups for reference |
| (?(VERSION[>]=n.m) test PCRE2 version |
| (?(assert) assertion condition |
| |
| Note the ambiguity of (?(R) and (?(Rn) which might be named reference |
| conditions or recursion tests. Such a condition is interpreted as a |
| reference condition if the relevant named group exists. |
| |
| |
| BACKTRACKING CONTROL |
| |
| All backtracking control verbs may be in the form (*VERB:NAME). For |
| (*MARK) the name is mandatory, for the others it is optional. (*SKIP) |
| changes its behaviour if :NAME is present. The others just set a name |
| for passing back to the caller, but this is not a name that (*SKIP) can |
| see. The following act immediately they are reached: |
| |
| (*ACCEPT) force successful match |
| (*FAIL) force backtrack; synonym (*F) |
| (*MARK:NAME) set name to be passed back; synonym (*:NAME) |
| |
| The following act only when a subsequent match failure causes a back- |
| track to reach them. They all force a match failure, but they differ in |
| what happens afterwards. Those that advance the start-of-match point do |
| so only if the pattern is not anchored. |
| |
| (*COMMIT) overall failure, no advance of starting point |
| (*PRUNE) advance to next starting character |
| (*SKIP) advance to current matching position |
| (*SKIP:NAME) advance to position corresponding to an earlier |
| (*MARK:NAME); if not found, the (*SKIP) is ignored |
| (*THEN) local failure, backtrack to next alternation |
| |
| The effect of one of these verbs in a group called as a subroutine is |
| confined to the subroutine call. |
| |
| |
| CALLOUTS |
| |
| (?C) callout (assumed number 0) |
| (?Cn) callout with numerical data n |
| (?C"text") callout with string data |
| |
| The allowed string delimiters are ` ' " ^ % # $ (which are the same for |
| the start and the end), and the starting delimiter { matched with the |
| ending delimiter }. To encode the ending delimiter within the string, |
| double it. |
| |
| |
| SEE ALSO |
| |
| pcre2pattern(3), pcre2api(3), pcre2callout(3), pcre2matching(3), |
| pcre2(3). |
| |
| |
| AUTHOR |
| |
| Philip Hazel |
| Retired from University Computing Service |
| Cambridge, England. |
| |
| |
| REVISION |
| |
| Last updated: 12 January 2022 |
| Copyright (c) 1997-2022 University of Cambridge. |
| ------------------------------------------------------------------------------ |
| |
| |
| PCRE2UNICODE(3) Library Functions Manual PCRE2UNICODE(3) |
| |
| |
| |
| NAME |
| PCRE - Perl-compatible regular expressions (revised API) |
| |
| UNICODE AND UTF SUPPORT |
| |
| PCRE2 is normally built with Unicode support, though if you do not need |
| it, you can build it without, in which case the library will be |
| smaller. With Unicode support, PCRE2 has knowledge of Unicode character |
| properties and can process strings of text in UTF-8, UTF-16, and UTF-32 |
| format (depending on the code unit width), but this is not the default. |
| Unless specifically requested, PCRE2 treats each code unit in a string |
| as one character. |
| |
| There are two ways of telling PCRE2 to switch to UTF mode, where char- |
| acters may consist of more than one code unit and the range of values |
| is constrained. The program can call pcre2_compile() with the PCRE2_UTF |
| option, or the pattern may start with the sequence (*UTF). However, |
| the latter facility can be locked out by the PCRE2_NEVER_UTF option. |
| That is, the programmer can prevent the supplier of the pattern from |
| switching to UTF mode. |
| |
| Note that the PCRE2_MATCH_INVALID_UTF option (see below) forces |
| PCRE2_UTF to be set. |
| |
| In UTF mode, both the pattern and any subject strings that are matched |
| against it are treated as UTF strings instead of strings of individual |
| one-code-unit characters. There are also some other changes to the way |
| characters are handled, as documented below. |
| |
| |
| UNICODE PROPERTY SUPPORT |
| |
| When PCRE2 is built with Unicode support, the escape sequences \p{..}, |
| \P{..}, and \X can be used. This is not dependent on the PCRE2_UTF set- |
| ting. The Unicode properties that can be tested are a subset of those |
| that Perl supports. Currently they are limited to the general category |
| properties such as Lu for an upper case letter or Nd for a decimal num- |
| ber, the Unicode script names such as Arabic or Han, Bidi_Class, |
| Bidi_Control, and the derived properties Any and LC (synonym L&). Full |
| lists are given in the pcre2pattern and pcre2syntax documentation. In |
| general, only the short names for properties are supported. For exam- |
| ple, \p{L} matches a letter. Its longer synonym, \p{Letter}, is not |
| supported. Furthermore, in Perl, many properties may optionally be pre- |
| fixed by "Is", for compatibility with Perl 5.6. PCRE2 does not support |
| this. |
| |
| |
| WIDE CHARACTERS AND UTF MODES |
| |
| Code points less than 256 can be specified in patterns by either braced |
| or unbraced hexadecimal escape sequences (for example, \x{b3} or \xb3). |
| Larger values have to use braced sequences. Unbraced octal code points |
| up to \777 are also recognized; larger ones can be coded using \o{...}. |
| |
| The escape sequence \N{U+<hex digits>} is recognized as another way of |
| specifying a Unicode character by code point in a UTF mode. It is not |
| allowed in non-UTF mode. |
| |
| In UTF mode, repeat quantifiers apply to complete UTF characters, not |
| to individual code units. |
| |
| In UTF mode, the dot metacharacter matches one UTF character instead of |
| a single code unit. |
| |
| In UTF mode, capture group names are not restricted to ASCII, and may |
| contain any Unicode letters and decimal digits, as well as underscore. |
| |
| The escape sequence \C can be used to match a single code unit in UTF |
| mode, but its use can lead to some strange effects because it breaks up |
| multi-unit characters (see the description of \C in the pcre2pattern |
| documentation). For this reason, there is a build-time option that dis- |
| ables support for \C completely. There is also a less draconian com- |
| pile-time option for locking out the use of \C when a pattern is com- |
| piled. |
| |
| The use of \C is not supported by the alternative matching function |
| pcre2_dfa_match() when in UTF-8 or UTF-16 mode, that is, when a charac- |
| ter may consist of more than one code unit. The use of \C in these |
| modes provokes a match-time error. Also, the JIT optimization does not |
| support \C in these modes. If JIT optimization is requested for a UTF-8 |
| or UTF-16 pattern that contains \C, it will not succeed, and so when |
| pcre2_match() is called, the matching will be carried out by the inter- |
| pretive function. |
| |
| The character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly test |
| characters of any code value, but, by default, the characters that |
| PCRE2 recognizes as digits, spaces, or word characters remain the same |
| set as in non-UTF mode, all with code points less than 256. This re- |
| mains true even when PCRE2 is built to include Unicode support, because |
| to do otherwise would slow down matching in many common cases. Note |
| that this also applies to \b and \B, because they are defined in terms |
| of \w and \W. If you want to test for a wider sense of, say, "digit", |
| you can use explicit Unicode property tests such as \p{Nd}. Alterna- |
| tively, if you set the PCRE2_UCP option, the way that the character es- |
| capes work is changed so that Unicode properties are used to determine |
| which characters match. There are more details in the section on |
| generic character types in the pcre2pattern documentation. |
| |
| Similarly, characters that match the POSIX named character classes are |
| all low-valued characters, unless the PCRE2_UCP option is set. |
| |
| However, the special horizontal and vertical white space matching es- |
| capes (\h, \H, \v, and \V) do match all the appropriate Unicode charac- |
| ters, whether or not PCRE2_UCP is set. |
| |
| |
| UNICODE CASE-EQUIVALENCE |
| |
| If either PCRE2_UTF or PCRE2_UCP is set, upper/lower case processing |
| makes use of Unicode properties except for characters whose code points |
| are less than 128 and that have at most two case-equivalent values. For |
| these, a direct table lookup is used for speed. A few Unicode charac- |
| ters such as Greek sigma have more than two code points that are case- |
| equivalent, and these are treated specially. Setting PCRE2_UCP without |
| PCRE2_UTF allows Unicode-style case processing for non-UTF character |
| encodings such as UCS-2. |
| |
| |
| SCRIPT RUNS |
| |
| The pattern constructs (*script_run:...) and (*atomic_script_run:...), |
| with synonyms (*sr:...) and (*asr:...), verify that the string matched |
| within the parentheses is a script run. In concept, a script run is a |
| sequence of characters that are all from the same Unicode script. How- |
| ever, because some scripts are commonly used together, and because some |
| diacritical and other marks are used with multiple scripts, it is not |
| that simple. |
| |
| Every Unicode character has a Script property, mostly with a value cor- |
| responding to the name of a script, such as Latin, Greek, or Cyrillic. |
| There are also three special values: |
| |
| "Unknown" is used for code points that have not been assigned, and also |
| for the surrogate code points. In the PCRE2 32-bit library, characters |
| whose code points are greater than the Unicode maximum (U+10FFFF), |
| which are accessible only in non-UTF mode, are assigned the Unknown |
| script. |
| |
| "Common" is used for characters that are used with many scripts. These |
| include punctuation, emoji, mathematical, musical, and currency sym- |
| bols, and the ASCII digits 0 to 9. |
| |
| "Inherited" is used for characters such as diacritical marks that mod- |
| ify a previous character. These are considered to take on the script of |
| the character that they modify. |
| |
| Some Inherited characters are used with many scripts, but many of them |
| are only normally used with a small number of scripts. For example, |
| U+102E0 (Coptic Epact thousands mark) is used only with Arabic and Cop- |
| tic. In order to make it possible to check this, a Unicode property |
| called Script Extension exists. Its value is a list of scripts that ap- |
| ply to the character. For the majority of characters, the list contains |
| just one script, the same one as the Script property. However, for |
| characters such as U+102E0 more than one Script is listed. There are |
| also some Common characters that have a single, non-Common script in |
| their Script Extension list. |
| |
| The next section describes the basic rules for deciding whether a given |
| string of characters is a script run. Note, however, that there are |
| some special cases involving the Chinese Han script, and an additional |
| constraint for decimal digits. These are covered in subsequent sec- |
| tions. |
| |
| Basic script run rules |
| |
| A string that is less than two characters long is a script run. This is |
| the only case in which an Unknown character can be part of a script |
| run. Longer strings are checked using only the Script Extensions prop- |
| erty, not the basic Script property. |
| |
| If a character's Script Extension property is the single value "Inher- |
| ited", it is always accepted as part of a script run. This is also true |
| for the property "Common", subject to the checking of decimal digits |
| described below. All the remaining characters in a script run must have |
| at least one script in common in their Script Extension lists. In set- |
| theoretic terminology, the intersection of all the sets of scripts must |
| not be empty. |
| |
| A simple example is an Internet name such as "google.com". The letters |
| are all in the Latin script, and the dot is Common, so this string is a |
| script run. However, the Cyrillic letter "o" looks exactly the same as |
| the Latin "o"; a string that looks the same, but with Cyrillic "o"s is |
| not a script run. |
| |
| More interesting examples involve characters with more than one script |
| in their Script Extension. Consider the following characters: |
| |
| U+060C Arabic comma |
| U+06D4 Arabic full stop |
| |
| The first has the Script Extension list Arabic, Hanifi Rohingya, Syr- |
| iac, and Thaana; the second has just Arabic and Hanifi Rohingya. Both |
| of them could appear in script runs of either Arabic or Hanifi Ro- |
| hingya. The first could also appear in Syriac or Thaana script runs, |
| but the second could not. |
| |
| The Chinese Han script |
| |
| The Chinese Han script is commonly used in conjunction with other |
| scripts for writing certain languages. Japanese uses the Hiragana and |
| Katakana scripts together with Han; Korean uses Hangul and Han; Tai- |
| wanese Mandarin uses Bopomofo and Han. These three combinations are |
| treated as special cases when checking script runs and are, in effect, |
| "virtual scripts". Thus, a script run may contain a mixture of Hira- |
| gana, Katakana, and Han, or a mixture of Hangul and Han, or a mixture |
| of Bopomofo and Han, but not, for example, a mixture of Hangul and |
| Bopomofo and Han. PCRE2 (like Perl) follows Unicode's Technical Stan- |
| dard 39 ("Unicode Security Mechanisms", http://unicode.org/re- |
| ports/tr39/) in allowing such mixtures. |
| |
| Decimal digits |
| |
| Unicode contains many sets of 10 decimal digits in different scripts, |
| and some scripts (including the Common script) contain more than one |
| set. Some of these decimal digits them are visually indistinguishable |
| from the common ASCII digits. In addition to the script checking de- |
| scribed above, if a script run contains any decimal digits, they must |
| all come from the same set of 10 adjacent characters. |
| |
| |
| VALIDITY OF UTF STRINGS |
| |
| When the PCRE2_UTF option is set, the strings passed as patterns and |
| subjects are (by default) checked for validity on entry to the relevant |
| functions. If an invalid UTF string is passed, a negative error code is |
| returned. The code unit offset to the offending character can be ex- |
| tracted from the match data block by calling pcre2_get_startchar(), |
| which is used for this purpose after a UTF error. |
| |
| In some situations, you may already know that your strings are valid, |
| and therefore want to skip these checks in order to improve perfor- |
| mance, for example in the case of a long subject string that is being |
| scanned repeatedly. If you set the PCRE2_NO_UTF_CHECK option at com- |
| pile time or at match time, PCRE2 assumes that the pattern or subject |
| it is given (respectively) contains only valid UTF code unit sequences. |
| |
| If you pass an invalid UTF string when PCRE2_NO_UTF_CHECK is set, the |
| result is undefined and your program may crash or loop indefinitely or |
| give incorrect results. There is, however, one mode of matching that |
| can handle invalid UTF subject strings. This is enabled by passing |
| PCRE2_MATCH_INVALID_UTF to pcre2_compile() and is discussed below in |
| the next section. The rest of this section covers the case when |
| PCRE2_MATCH_INVALID_UTF is not set. |
| |
| Passing PCRE2_NO_UTF_CHECK to pcre2_compile() just disables the UTF |
| check for the pattern; it does not also apply to subject strings. If |
| you want to disable the check for a subject string you must pass this |
| same option to pcre2_match() or pcre2_dfa_match(). |
| |
| UTF-16 and UTF-32 strings can indicate their endianness by special code |
| knows as a byte-order mark (BOM). The PCRE2 functions do not handle |
| this, expecting strings to be in host byte order. |
| |
| Unless PCRE2_NO_UTF_CHECK is set, a UTF string is checked before any |
| other processing takes place. In the case of pcre2_match() and |
| pcre2_dfa_match() calls with a non-zero starting offset, the check is |
| applied only to that part of the subject that could be inspected during |
| matching, and there is a check that the starting offset points to the |
| first code unit of a character or to the end of the subject. If there |
| are no lookbehind assertions in the pattern, the check starts at the |
| starting offset. Otherwise, it starts at the length of the longest |
| lookbehind before the starting offset, or at the start of the subject |
| if there are not that many characters before the starting offset. Note |
| that the sequences \b and \B are one-character lookbehinds. |
| |
| In addition to checking the format of the string, there is a check to |
| ensure that all code points lie in the range U+0 to U+10FFFF, excluding |
| the surrogate area. The so-called "non-character" code points are not |
| excluded because Unicode corrigendum #9 makes it clear that they should |
| not be. |
| |
| Characters in the "Surrogate Area" of Unicode are reserved for use by |
| UTF-16, where they are used in pairs to encode code points with values |
| greater than 0xFFFF. The code points that are encoded by UTF-16 pairs |
| are available independently in the UTF-8 and UTF-32 encodings. (In |
| other words, the whole surrogate thing is a fudge for UTF-16 which un- |
| fortunately messes up UTF-8 and UTF-32.) |
| |
| Setting PCRE2_NO_UTF_CHECK at compile time does not disable the error |
| that is given if an escape sequence for an invalid Unicode code point |
| is encountered in the pattern. If you want to allow escape sequences |
| such as \x{d800} (a surrogate code point) you can set the PCRE2_EX- |
| TRA_ALLOW_SURROGATE_ESCAPES extra option. However, this is possible |
| only in UTF-8 and UTF-32 modes, because these values are not repre- |
| sentable in UTF-16. |
| |
| Errors in UTF-8 strings |
| |
| The following negative error codes are given for invalid UTF-8 strings: |
| |
| PCRE2_ERROR_UTF8_ERR1 |
| PCRE2_ERROR_UTF8_ERR2 |
| PCRE2_ERROR_UTF8_ERR3 |
| PCRE2_ERROR_UTF8_ERR4 |
| PCRE2_ERROR_UTF8_ERR5 |
| |
| The string ends with a truncated UTF-8 character; the code specifies |
| how many bytes are missing (1 to 5). Although RFC 3629 restricts UTF-8 |
| characters to be no longer than 4 bytes, the encoding scheme (origi- |
| nally defined by RFC 2279) allows for up to 6 bytes, and this is |
| checked first; hence the possibility of 4 or 5 missing bytes. |
| |
| PCRE2_ERROR_UTF8_ERR6 |
| PCRE2_ERROR_UTF8_ERR7 |
| PCRE2_ERROR_UTF8_ERR8 |
| PCRE2_ERROR_UTF8_ERR9 |
| PCRE2_ERROR_UTF8_ERR10 |
| |
| The two most significant bits of the 2nd, 3rd, 4th, 5th, or 6th byte of |
| the character do not have the binary value 0b10 (that is, either the |
| most significant bit is 0, or the next bit is 1). |
| |
| PCRE2_ERROR_UTF8_ERR11 |
| PCRE2_ERROR_UTF8_ERR12 |
| |
| A character that is valid by the RFC 2279 rules is either 5 or 6 bytes |
| long; these code points are excluded by RFC 3629. |
| |
| PCRE2_ERROR_UTF8_ERR13 |
| |
| A 4-byte character has a value greater than 0x10ffff; these code points |
| are excluded by RFC 3629. |
| |
| PCRE2_ERROR_UTF8_ERR14 |
| |
| A 3-byte character has a value in the range 0xd800 to 0xdfff; this |
| range of code points are reserved by RFC 3629 for use with UTF-16, and |
| so are excluded from UTF-8. |
| |
| PCRE2_ERROR_UTF8_ERR15 |
| PCRE2_ERROR_UTF8_ERR16 |
| PCRE2_ERROR_UTF8_ERR17 |
| PCRE2_ERROR_UTF8_ERR18 |
| PCRE2_ERROR_UTF8_ERR19 |
| |
| A 2-, 3-, 4-, 5-, or 6-byte character is "overlong", that is, it codes |
| for a value that can be represented by fewer bytes, which is invalid. |
| For example, the two bytes 0xc0, 0xae give the value 0x2e, whose cor- |
| rect coding uses just one byte. |
| |
| PCRE2_ERROR_UTF8_ERR20 |
| |
| The two most significant bits of the first byte of a character have the |
| binary value 0b10 (that is, the most significant bit is 1 and the sec- |
| ond is 0). Such a byte can only validly occur as the second or subse- |
| quent byte of a multi-byte character. |
| |
| PCRE2_ERROR_UTF8_ERR21 |
| |
| The first byte of a character has the value 0xfe or 0xff. These values |
| can never occur in a valid UTF-8 string. |
| |
| Errors in UTF-16 strings |
| |
| The following negative error codes are given for invalid UTF-16 |
| strings: |
| |
| PCRE2_ERROR_UTF16_ERR1 Missing low surrogate at end of string |
| PCRE2_ERROR_UTF16_ERR2 Invalid low surrogate follows high surrogate |
| PCRE2_ERROR_UTF16_ERR3 Isolated low surrogate |
| |
| |
| Errors in UTF-32 strings |
| |
| The following negative error codes are given for invalid UTF-32 |
| strings: |
| |
| PCRE2_ERROR_UTF32_ERR1 Surrogate character (0xd800 to 0xdfff) |
| PCRE2_ERROR_UTF32_ERR2 Code point is greater than 0x10ffff |
| |
| |
| MATCHING IN INVALID UTF STRINGS |
| |
| You can run pattern matches on subject strings that may contain invalid |
| UTF sequences if you call pcre2_compile() with the PCRE2_MATCH_IN- |
| VALID_UTF option. This is supported by pcre2_match(), including JIT |
| matching, but not by pcre2_dfa_match(). When PCRE2_MATCH_INVALID_UTF is |
| set, it forces PCRE2_UTF to be set as well. Note, however, that the |
| pattern itself must be a valid UTF string. |
| |
| Setting PCRE2_MATCH_INVALID_UTF does not affect what pcre2_compile() |
| generates, but if pcre2_jit_compile() is subsequently called, it does |
| generate different code. If JIT is not used, the option affects the be- |
| haviour of the interpretive code in pcre2_match(). When PCRE2_MATCH_IN- |
| VALID_UTF is set at compile time, PCRE2_NO_UTF_CHECK is ignored at |
| match time. |
| |
| In this mode, an invalid code unit sequence in the subject never |
| matches any pattern item. It does not match dot, it does not match |
| \p{Any}, it does not even match negative items such as [^X]. A lookbe- |
| hind assertion fails if it encounters an invalid sequence while moving |
| the current point backwards. In other words, an invalid UTF code unit |
| sequence acts as a barrier which no match can cross. |
| |
| You can also think of this as the subject being split up into fragments |
| of valid UTF, delimited internally by invalid code unit sequences. The |
| pattern is matched fragment by fragment. The result of a successful |
| match, however, is given as code unit offsets in the entire subject |
| string in the usual way. There are a few points to consider: |
| |
| The internal boundaries are not interpreted as the beginnings or ends |
| of lines and so do not match circumflex or dollar characters in the |
| pattern. |
| |
| If pcre2_match() is called with an offset that points to an invalid |
| UTF-sequence, that sequence is skipped, and the match starts at the |
| next valid UTF character, or the end of the subject. |
| |
| At internal fragment boundaries, \b and \B behave in the same way as at |
| the beginning and end of the subject. For example, a sequence such as |
| \bWORD\b would match an instance of WORD that is surrounded by invalid |
| UTF code units. |
| |
| Using PCRE2_MATCH_INVALID_UTF, an application can run matches on arbi- |
| trary data, knowing that any matched strings that are returned are |
| valid UTF. This can be useful when searching for UTF text in executable |
| or other binary files. |
| |
| |
| AUTHOR |
| |
| Philip Hazel |
| Retired from University Computing Service |
| Cambridge, England. |
| |
| |
| REVISION |
| |
| Last updated: 22 December 2021 |
| Copyright (c) 1997-2021 University of Cambridge. |
| ------------------------------------------------------------------------------ |
| |
| |