Scan Substring in PCRE2

About

Scan substring (scs for short) is a new type of non-atomic assertion supported by PCRE2 from version 10.45. It allows rematching the content of a capture block using a regular expression pattern.

Syntax:

(*scan_substring:(CAPTURE_LIST)PATTERN)
(*scs:(CAPTURE_LIST)PATTERN)

Description:

Scan substring is an assertion, so it matches to an empty string and captures the (*ACCEPT) control verb. Since it is a non-atomic assertion, backtracking into its PATTERN is possible after a successful match, similar to the (*napla:PATTERN) assertion. Scan substring can be made atomic by using an atomic block: (?>(*scs:(CAPTURE_LIST)PATTERN)).

Unlike most other assertions, scan substring has an argument called capture list. This argument is enclosed in parenthesis and must be placed right after (*scan_substring: or (*scs: pattern strings. The capture list is a comma separated list of capturing block references. These references can be absolute or relative numbers, or capturing block names, e.g.: (7,+5,-2) or (<NAME1>,'NAME2'). The capture references are checked in declaration order. The first capturing block which is set (it was successfully matched before) is used as the substring for the scan substring assertion even if the substring is an empty string. In the last example the NAME1 group is checked first, and if it is not set, then the NAME2 group is checked. If no capturing group in the list is set, the scan substring assertion fails to match.

If the substring is successfully found, the sub-pattern represented by the PATTERN is matched from the beginning of the substring. Furthermore, the end of the substring is used as the end of the subject (input) string, so the PATTERN cannot match to any character beyond that. This limitation affects the PATTERN only. When scan substring assertions are nested, they are independent from each other and they can use different subject ends depending on their substring. Lookbehind assertions can check the characters before the beginning of the substring.

Examples:

The following pattern searches //... and /*...*/ comments in a text, and then checks that AA, and BB strings are present in the comments:
/(?:\/\/(.*)|\/\*((?s).*?)\*\/)(*SKIP)(?s)(*scs:(1,2).*?AA)(*scs:(1,2).*?BB)/
The \/\/(.*) searches for //... comments, and the text inside the comment is captured by capturing group 1. Similarly, \/\*((?s).*?)\*\/ searches for /*...*/ comments, and the text inside the comment is captured by capturing group 2. The (*SKIP) control verb ensures, that /*../*..*/ comments are processed as a single comment, not as multiple comments. The (*scs:(1,2).*AA) searches AA inside the comment text, regardless which type of comment is found. The order of AA and BB in the comment text does not matter, since scan substring is used twice.

The next pattern searches for a "Password" string enclosed in <table></table> tags. The enclosed text may contain other, nested tags, including other table tags.
/(<(?=table>)((\w+)>([^<]*+<(?!\/)(?2))*[^<]*+<\/\3>))(*SKIP)(*scs:(1)(?s).*?Password)/
This pattern uses recursion. Recursions restore capturing groups when they are matched, so extracting information from a recursion is difficult. Instead, the scan substring assertion is used to search the enclosed text, which is stored in a capturing group.

Last modification: 05.10.2024