|
Scan Substring in PCRE2
About
Scan substring (scs for short) is a new type of non-atomic assertion supported by PCRE2 from
version 10.45. It allows rematching the content of a capture block using a regular expression
pattern.
Syntax:
(*scan_substring:(CAPTURE_LIST)PATTERN)
(*scs:(CAPTURE_LIST)PATTERN)
Description:
Scan substring is an assertion, so it matches to an empty string and captures the
(*ACCEPT) control verb. Since it is a non-atomic assertion, backtracking
into its PATTERN is possible after a successful match, similar to the
(*napla:PATTERN) assertion. Scan substring can be made atomic by using
an atomic block: (?>(*scs:(CAPTURE_LIST)PATTERN)) .
Unlike most other assertions, scan substring has an argument called capture list. This argument
is enclosed in parenthesis and must be placed right after (*scan_substring:
or (*scs: pattern strings. The capture list is a comma separated list of
capturing block references. These references can be absolute or relative numbers, or capturing
block names, e.g.: (7,+5,-2) or (<NAME1>,'NAME2') .
The capture references are checked in declaration order. The first capturing block which is
set (it was successfully matched before) is used as the substring for the scan substring
assertion even if the substring is an empty string. In the last example the NAME1
group is checked first, and if it is not set, then the NAME2 group is
checked. If no capturing group in the list is set, the scan substring assertion fails to match.
If the substring is successfully found, the sub-pattern represented by the PATTERN
is matched from the beginning of the substring. Furthermore, the end of the substring is used
as the end of the subject (input) string, so the PATTERN cannot match to
any character beyond that. This limitation affects the PATTERN only.
When scan substring assertions are nested, they are independent from each other and they can
use different subject ends depending on their substring. Lookbehind assertions can check
the characters before the beginning of the substring.
Examples:
The following pattern searches //... and /*...*/
comments in a text, and then checks that AA, and BB strings are present in the comments:
/(?:\/\/(.*)|\/\*((?s).*?)\*\/)(*SKIP)(?s)(*scs:(1,2).*?AA)(*scs:(1,2).*?BB)/
The \/\/(.*) searches for //... comments,
and the text inside the comment is captured by capturing group 1. Similarly,
\/\*((?s).*?)\*\/ searches for /*...*/
comments, and the text inside the comment is captured by capturing group 2.
The (*SKIP) control verb ensures, that
/*../*..*/ comments are processed as a single comment,
not as multiple comments. The (*scs:(1,2).*AA)
searches AA inside the comment text, regardless which type of comment is found.
The order of AA and BB in the comment text does not matter, since scan substring
is used twice.
The next pattern searches for a "Password" string enclosed in
<table></table> tags. The enclosed
text may contain other, nested tags, including other table tags.
/(<(?=table>)((\w+)>([^<]*+<(?!\/)(?2))*[^<]*+<\/\3>))(*SKIP)(*scs:(1)(?s).*?Password)/
This pattern uses recursion. Recursions restore capturing groups when they
are matched, so extracting information from a recursion is difficult. Instead,
the scan substring assertion is used to search the enclosed text, which is
stored in a capturing group.
Last modification: 05.10.2024 |
| |