JIT Compiler of PCRE2

Introduction

PCRE2 is a C library for matching regular expressions. This library translates patterns into byte code, and this byte code can be executed by its interpreter. The library also has a just-in-time (JIT) compiler which can produce machine code from the byte code to further improve the matching performance. Before going into the details of the JIT compiler, the reason of choosing byte code as an internal representation is explained first.

Regular expressions were originally designed for defining a set of words using special patterns, and engines only needed to decide whether a given word is in the set. Eventually patterns become a list of commands, and pattern matching engines become script languages. Today many constructs of structured programming are available in regular expressions (pcre2pattern / perlre) such as loops (called repetitions), if-else selection (called conditional blocks), function calls (called recursions), returns and breaks (called control verbs). The execution of each command is accurately defined which allows fine control over the entire matching process. Patterns can also store information in local variables (called capturing blocks and marks) and these variables are often available after the matching is completed. Byte code is used as an internal representation in many script languages, and modern regular expression engines are no different. There are several methods to improve the performance of byte code execution (including JIT compiling), but even the fastest script language can do very little to improve the speed of a slow bubble sort. Hence it is recommended to carefully design patterns to avoid slow matching speed.

Structure of the generated code

Regular expression patterns are a sequence of sub-patterns which are matched in concatenation order. When a sub-pattern matches successfully, the engine tries to match the next sub-pattern, and when its matching is failed, the engine falls back to the previous sub-pattern. If the previous sub-pattern has multiple options for matching, and another option does match, the engine will try to match the next sub-pattern again. This technique is called backtracking.

The code generated by PCRE2 JIT is specialized for this execution order. For each sub-pattern the compiler generates two code paths called matching and backtracking path. The order of matching paths in the generated code is the same as the original concatenation order since the next sub-pattern must be matched after a successful match. However the backtracking paths are generated in a reversed concatenation order because the engine must fall back to the previous sub-pattern after a failed match. Control flow is shown in the next figure where green and red arrows show the possible control transfers from matching and backtracking paths respectively. Due to the code generation order described before, no jump instructions are needed for control transfers represented by horizontal arrows.
Control flow
The next figure shows the structure of the machine code generated from the /a(?:\w\d)+d/ pattern:

     # Start of matching paths
   ENTER
     # Matching path of matching the "a" letter
   MATCH "a", IF FAILS GOTO L6
     # Matching path of (?:)+
   STACK_PUSH NULL
L1:
     # Matching path of matching the \w special character
   MATCH WORD_CHARACTER, IF FAILS GOTO L5
     # Matching path of matching the \d special character
   MATCH DIGIT_CHARACTER, IF FAILS GOTO L4
     # Continue the matching path of (?:)+
   STACK_PUSH STRING_POINTER
   GOTO L1
L2:
     # Matching path of matching the "d" letter
   MATCH "d", IF FAILS GOTO L3
   RETURN SUCCESS

     # Start of backtracking paths
     # Backtracking path of matching the "d" letter (empty)
L3:
     # Backtracking path of (?:)+ (empty)
     # Backtracking path of matching the dot special character (empty)
L4:
     # Backtracking path of matching the "b" letter (empty)
L5:
     # Continue backtracking path of (?:)+
   STRING_POINTER = STACK_POP
   IF STRING_POINTER != NULL GOTO L2
     # Backtracking path of matching the "a" letter (empty)
L6:
   RETURN FAIL
When a matching path has only one alternative (e.g. matching the letter "a"), the backtracking path is often empty, i.e. no instructions are generated at all. Some constructs may contain the code paths of other sub-patterns. In our example above, the code paths of \w\d sub-pattern is nested into the code paths of (?:)+. Furthermore the current position in the input is stored in the STRING_POINTER variable. This variable must have the correct value when the execution enters into a matching path, but it is undefined for backtracking paths. Hence the backtracking path must restore (or set) it before jumps to a matching path. This restoration can be seen in the backtracking path of (?:)+.

More about this method can be found in this research paper. In the followings some optimizations done by the compiler is shown.

Single character repetition optimizations

Sometimes multiple letters represent a single character, such as "sh" in English or "sch" in German. These sequences cannot be broken into smaller parts, so perl calls them atomic blocks. Another well known example is \r\n which represents a single newline. Although \r and \n are also newlines, the /x\R{2}y/ pattern does not match to x\r\ny, because there is only one newline between "x" and "y" letters. Atomic blocks can speed up matching, since they can eliminate many retries. For example /a+b/ and /a++b/ matches to the same strings since no "b" can be present in a sequence of "a"-s. Hence the engine automatically converts the repetitions to the latter form whenever it has no side effects.

The JIT compiler can speed up repetitions followed by a single character as well. An example is /\w+x/, where \w+ cannot be converted to \w++ because "x" is included in \w character class. The code generated for \w+ also records the first and last occurrence of "x" and backtracking is limited to this range. This optimization has the highest efficiency when at most one "x" is found during the repetition.

The (*SKIP) control verb allows moving forward in the input. For example, this verb can be used to search identifiers in a source code excluding comments and string literals. To do that, the pattern should also look for comments and string literals but it skips these blocks instead of returning with a match. Skip can be used to improve performance as well. The /\w+!/ and /\w+(*SKIP)!/ patterns match to the same strings because when \w+ is matched to a range of characters, but the match is failed afterwards, then any other match attempts within this range must fail as well. Therefore the search can be resumed from the end of the range which eliminates several match attempts. Besides adding (*SKIP) verbs to the pattern, the JIT compiler can also enhance iterators to remember the range of their last successful match, and any future matching attempts within this range triggers a fail.

Character class optimizations

Checking that a character is in a set of characters (called character classes) is a frequent operation in pattern matching. When caseless mode is enabled, even a simple pattern character may match to different characters especially when UTF is enabled.

The most basic optimization is caseless compare with a single compare instruction if there is only one bit difference between the binary representation of the lower and upper case of a character. For example, ascii x and X differs only in bit 6, so (chr|0x20)=='x' performs a caseless compare, and it is faster than doing two compares. This can be extended to ranges, so the [f-xF-X] range can be checked with this formula:((chr|0x20)-'f')<('x'-'f'). Optimizing these cases is no different from optimizing conditional expressions in static compilers such as GCC.

The JIT compiler also has some UTF related optimizations. For example, if a class is limited to characters less than 128 in UTF8 mode, reading only the next byte is enough and it is faster than decoding the UTF8 character. This optimization is not limited to the previous case, the compiler can generate specialized readers which do not decode characters outside of a given range. When invalid UTF8 parsing is enabled, the character reader first checks whether the input buffer has at least four remaining bytes. This is nearly always true, and then the character can be decoded without further boundary checks since four is the size of the longest UTF8 character.

The JIT compiler generates helper functions for several operations, such as decoding complex UTF cases. These functions are extremely lightweight: they don't set up a call frame or save any CPU registers, and can even modify the status of the caller. The effect is quite similar to inlining, except that the function is actually called. Must CPUs has return address predictors, so calling these functions are cheap operations.

Literal optimizations

Literals are fixed character sequences and they are often found in patterns. Comparing them to the input is a frequent task which normally requires the same number of steps as the length of the literal. However CPU registers are often large enough to hold multiple characters so the number of comparisons can be reduced as long as the CPU can read a whole register from any starting address. Such read is called unaligned access and requires hardware support.

Single Instruction Multiple Data (SIMD) operations can speed up literal searching. These operations can process the input efficiently since they use large registers which can hold several input characters. The JIT compiler uses SIMD for searching characters and character pairs. Searching a pair is based on an observation that the chance of finding a pair is often much lower than finding a single character. The only condition is that the two characters of a pair must be different since searching two spaces instead of one might not give better results. The compiler tries to detect the pair with the lowest match probability, so it will choose the "a" and "c" characters from the /a.(?i:b).c/ pattern.

Summary

PCRE2 is among the most feature rich regular expression libraries. It supports many regular expression constructs and it has many operating modes. Its JIT compiler tries to optimize many of these features with static compiler optimizations, and this text introduces some of these optimizations. Besides these optimizations, directed pattern transformations can also help to improve performance, but that is the focus of other projects such as repan.

Last modification: 08.11.2020