|
PCRE Performance Project
About
The aim of PCRE-sljit project is speeding up the pattern matching speed of
Perl Compatible Regular Expressions library
(ftp download).
The task is achieved by using sljit,
a just-in-time (JIT) compilation library to translate machine code from the internal
byte-code representation generated by pcre_compile(). PCRE-sljit offers similar
matching speed to DFA based engines (like re2)
on common patterns but still keep PERL compatibility (see
here).
This work has been released as part of PCRE 8.20 and above. The JIT was improved a lot
in 8.32, and a new so called native interface was introduced. See the results.
About performance optimizations
Always do profiling! PCRE-JIT can only help you, if matching regular expressions
takes at least 4-5% of the total runtime. Otherwise you might not see any performance
increase (or you will see performance drops) due to the changes of the binary layout.
It is unfortunately less known, that inserting nops can increase or decrease the runtime
of any program by up to 3%, due the the CPU cache layout, branch prediction mechanisms,
etc. In artificial cases, the runtime change can be even bigger (±50% for example).
When any function is modified, even if the change is small, it affects the entire binary
layout, since the entry offset of other functions will be changed as well (especially those,
which are placed after this function in the executable by the linker). Therefore when the
ratio of matching regular expressions is very low, you might experience a slight performance
drop when using PCRE-JIT.
Usage
Because of compatibility reasons the JIT compilation must be explicitly requested
by software. First of all PCRE must be compiled with --enable-jit. (Note:
JIT compiler availability can be checked runtime by passing PCRE_CONFIG_JIT
to pcre_config().) The next step is performing the JIT compilation by passing
PCRE_STUDY_JIT_COMPILE, PCRE_STUDY_JIT_PARTIAL_SOFT_COMPILE or
PCRE_STUDY_JIT_PARTIAL_HARD_COMPILE to pcre_study(). Regardless of
the success of JIT compilation, the returned extra data can be passed to pcre_exec(),
which use the machine code when approprite, or fallback to the interpreted code path
(Note: a few patterns are not supported by the JIT compiler). Passing PCRE_INFO_JIT
to pcre_fullinfo() can be used to check whether the JIT compilation is successful.
If the JIT compilation is succesful pcre[16|32]_jit_exec can be used to reach
even better performance. It is also mandatory to use pcre_free_study()
to free the machine code when the JIT compilation is succesful. Otherwise
pcre_free_study() is optional for compatibility reasons, but we suggest
to use it all the time.
Detailed info about the JIT compiler can be found in
pcrejit.3, which is part of the PCRE documentation. It is advisable to read the
"CONTROLLING THE JIT STACK" section there.
Motivation
The importance of pattern matching has been significantly grown in the last decade.
The increased computation power allows running more complex regular expressions in
an acceptable time frame. However, this is not the only way to speed up regular
expressions. The developers of JavaScript engines realized that their just-in-time
compilation engines are able to considerably improve the runtime of pattern matching.
Thus, JIT based regular expression engines were developed like
irregexp
from Google and
YARR (Yet Another Regex Runtime) from Apple. However, these engines are heavily tied
to their JavaScript engine (V8 from Google and JavaScriptCore from Apple), and other
software cannot benefit from this shiny new technology, until now. The aim of this
work is to extend a widely used regular expression library (PCRE) with a JIT compiler
engine, which can be easly adopted by any software already using PCRE.
How does it work?
First, a regular expression is compiled into an internal representation by pcre_compile().
The internal representation (usually called MIR - Middle Level Representation) is a sequence
of byte-codes. Each byte-code is basically a command, which tells the next step for the
interpreter inside pcre_exec(). The JIT compiler translates MIR to machine code
when the appropriate flags are passed to pcre_study(). The returned pcre_extra data
contains a pointer to a machine executable function if the machine code generation was successful.
Why is it faster?
JIT compilers totally eliminate the continual reparsing of MIR. Even if the MIR code is much
simpler than the original pattern string, the execution engine is full of ifs and
switches, and executing them consumes considerable time. The compiled machine
code only contains those machine instructions whose are absolutely necessary for this
particular pattern, and nothing more.
This advantage is also a limitation since JIT compiler does not support the
changing of compile time flags. One such flag is the newline type (PCRE supports
many of them). PCRE allows overwriting the compile time newline flags when
pcre_exec() is called. This is not supported by the JIT compiler,
and the regular expression will be fallbacks to the interpreter.
Compile time overhead
Deciding when to use or not use JIT compiling is an important question. Since JIT is a heavyweight
optimization we should never forget about the compile time overhead. Thus, the total
runtime can be bigger if we use the compiled expression only once on a small input.
The following values measured on an Intel 2.67GHz with GCC 4.4.5 in 64 bit mode
Note: ns means nanosecond (10-9), and Int. means interpreter (pcre_exec())
Pattern | Compile time (ns) | JIT compile time (ns) | JIT compile / compile time |
abc | 264 | 7932 | 30.05 |
a+(b*)[c|d]??e+? | 925 | 17186 | 18.58 |
\A.*\K[^a-z!]{8,}\Z | 925 | 11501 | 12.43 |
\b((.)\w{3,}\2{0,3})$ | 1057 | 21747 | 20.57 |
(?P<cap>\p{N}+|(?(R)a|(?R)))\1 | 1322 | 30869 | 23.35 |
^(?![a-zA-Z]*?[[:alpha:]]++)\P{L&} | 1520 | 20425 | 13.44 |
Results
The following patterns are searched on a non-utf8 input
and an utf8 input. The utf8 input was shorter
so it was repeated 4 times (with a simple memcpy()).
The performace is measured on an Intel 2.67GHz with GCC 4.4.5 in both 32 and 64 bit mode. The unit of the
runtime values are milliseconds (ms) which is 10-3 second. The PCRE library was compiled with
-O3 and -static compiler options to get the best performance form the compiler.
Description of the colums
- Pattern: the pattern string. The light blue lines are caseless, the light yellow lines are caseful matches.
- Int. (ms): interpreter (normal pcre_exec(...)) runtime in milliseconds.
- JIT (ms): JIT compiled machine code runtime in milliseconds.
- Runtime ratio: Interpreter runtime / JIT runtime.
- Runtime save %: Runtime saved by the JIT compiler. Equals to 1 - (JIT runtime / Interpreter runtime)
Pattern | Int. (ms) | JIT (ms) | Runtime ratio | Runtime save % |
how to | 60 | 40 | 1.50 | 33.3% |
^how to | 150 | 20 | 7.50 | 86.7% |
how( to|ever) | 60 | 30 | 2.00 | 50.0% |
walk|get|she|make | 240 | 70 | 3.43 | 70.8% |
[Ww]alk|[Gg]et|[Ss]he|[Mm]ake | 250 | 90 | 2.78 | 64.0% |
(?:ba|lu|ro)+(?:r|ck)* | 210 | 70 | 3.00 | 66.7% |
(?:ba|lu|ro)+(r|ck)* | 210 | 70 | 3.00 | 66.7% |
\w+our\w* | 1680 | 410 | 4.10 | 75.6% |
(?:^\w+$)|\b(\W)\1+\b | 1630 | 310 | 5.26 | 81.0% |
"[A-Z][a-z]*(?:(?:[,\s]|\R)(\s|\R)*[a-z]+){0,4}?[.!]" | 90 | 40 | 2.25 | 55.6% |
\b(?:(?=\w+ro\w+)(?=\w+th\w+)\w+hr\w+)\b | 1200 | 300 | 4.00 | 75.0% |
\b(?(?=\w+sh)\w+a|\w+uo)\w+\b | 1870 | 370 | 5.05 | 80.2% |
\b\WX{0,3}(?:(?:V?I{1,3})|V|IV|IX)\W\b | 820 | 180 | 4.56 | 78.0% |
((\w)+?)\.\b(\s)*?\w | 4380 | 540 | 8.11 | 87.7% |
(.{1,3})(\w*?\1)(?2) | 6060 | 1530 | 3.96 | 74.8% |
-?-?-?-?-?-?-?-?-?-?-?-----------$ | 600 | 100 | 6.00 | 83.3% |
| | | | |
Average: | 1219 | 260 | 4.16 | 70.6% |
Non-utf8 input on a 32 bit x86 machine.
Pattern | Int. (ms) | JIT (ms) | Runtime ratio | Runtime save % |
how to | 60 | 40 | 1.50 | 33.3% |
^how to | 180 | 40 | 4.50 | 77.8% |
how( to|ever) | 60 | 20 | 3.00 | 66.7% |
walk|get|she|make | 300 | 80 | 3.75 | 73.3% |
[Ww]alk|[Gg]et|[Ss]he|[Mm]ake | 280 | 100 | 2.80 | 64.3% |
(?:ba|lu|ro)+(?:r|ck)* | 220 | 80 | 2.75 | 63.6% |
(?:ba|lu|ro)+(r|ck)* | 240 | 80 | 3.00 | 66.7% |
\w+our\w* | 2020 | 360 | 5.61 | 82.2% |
(?:^\w+$)|\b(\W)\1+\b | 2060 | 340 | 6.06 | 83.5% |
"[A-Z][a-z]*(?:(?:[,\s]|\R)(\s|\R)*[a-z]+){0,4}?[.!]" | 80 | 40 | 2.00 | 50.0% |
\b(?:(?=\w+ro\w+)(?=\w+th\w+)\w+hr\w+)\b | 1340 | 300 | 4.47 | 77.6% |
\b(?(?=\w+sh)\w+a|\w+uo)\w+\b | 2060 | 360 | 5.72 | 82.5% |
\b\WX{0,3}(?:(?:V?I{1,3})|V|IV|IX)\W\b | 940 | 160 | 5.88 | 83.0% |
((\w)+?)\.\b(\s)*?\w | 5660 | 520 | 10.88 | 90.8% |
(.{1,3})(\w*?\1)(?2) | 7200 | 1620 | 4.44 | 77.5% |
-?-?-?-?-?-?-?-?-?-?-?-----------$ | 680 | 100 | 6.80 | 85.3% |
| | | | |
Average: | 1461 | 265 | 4.57 | 72.4% |
Non-utf8 input on a 64 bit x86 machine.
Pattern | Int. (ms) | JIT (ms) | Runtime ratio | Runtime save % |
die der | 10 | 10 | 1.00 | 0.0% |
ist|der|die|und | 100 | 20 | 5.00 | 80.0% |
\b\w+\b | 220 | 140 | 1.57 | 36.4% |
(?:da|ge|om)+(?:n|me)* | 60 | 20 | 3.00 | 66.7% |
\b(?(?=\w+ro)\w+pa|\w+lle)\w+\b | 560 | 300 | 1.87 | 46.4% |
\b(\W)\1+\b|(^(?=.*kl)(?=.*no).{15,40}$) | 680 | 200 | 3.40 | 70.6% |
^.{4,32}(\P{N})\1{2,}.{4,32}(?<![nuk])$ | 240 | 60 | 4.00 | 75.0% |
^(\w{3,})(?!\1).*\h.*\1$ | 5000 | 2180 | 2.29 | 56.4% |
((\w{2,8},?(\P{Z}|\R)){1,2}\.\s?)$ | 4320 | 1380 | 3.13 | 68.1% |
\b\w*?((.){1,3}\w*\2)\w*?(?1) | 1720 | 700 | 2.46 | 59.3% |
\w*?(b{2,3})\w*?c | 1120 | 400 | 2.80 | 64.3% |
\P{Lu}\P{L&}{0,12}[\s\-]{1,4}..[\P{L}\P{N}]{4} | 300 | 100 | 3.00 | 66.7% |
\b(\B([c-h])\B|[a-z]+?(?1)[a-z]) | 900 | 280 | 3.21 | 68.9% |
| | | | |
Average: | 1172 | 446 | 2.83 | 58.4% |
Utf8 input on a 32 bit x86 machine.
Pattern | Int. (ms) | JIT (ms) | Runtime ratio | Runtime save % |
die der | 10 | 10 | 1.00 | 0.0% |
ist|der|die|und | 100 | 30 | 3.33 | 70.0% |
\b\w+\b | 220 | 110 | 2.00 | 50.0% |
(?:da|ge|om)+(?:n|me)* | 80 | 20 | 4.00 | 75.0% |
\b(?(?=\w+ro)\w+pa|\w+lle)\w+\b | 540 | 230 | 2.35 | 57.4% |
\b(\W)\1+\b|(^(?=.*kl)(?=.*no).{15,40}$) | 640 | 170 | 3.76 | 73.4% |
^.{4,32}(\P{N})\1{2,}.{4,32}(?<![nuk])$ | 200 | 50 | 4.00 | 75.0% |
^(\w{3,})(?!\1).*\h.*\1$ | 4140 | 1710 | 2.42 | 58.7% |
((\w{2,8},?(\P{Z}|\R)){1,2}\.\s?)$ | 4090 | 1080 | 3.79 | 73.6% |
\b\w*?((.){1,3}\w*\2)\w*?(?1) | 1490 | 550 | 2.71 | 63.1% |
\w*?(b{2,3})\w*?c | 1130 | 290 | 3.90 | 74.3% |
\P{Lu}\P{L&}{0,12}[\s\-]{1,4}..[\P{L}\P{N}]{4} | 270 | 80 | 3.38 | 70.4% |
\b(\B([c-h])\B|[a-z]+?(?1)[a-z]) | 750 | 240 | 3.12 | 68.0% |
| | | | |
Average: | 1050 | 351 | 3.06 | 62.2% |
Utf8 input on a 64 bit x86 machine.
Conclusion
JIT compiling is a powerful technology, which is able to speed up interpreted execution
including Java, JavaScript, ActionScript (with NanoJIT) or regular expression engines.
Contribution
I would like to encourage all of you to join the PCRE Performance Project, and
make PCRE faster than ever. My email address is hzmester (at) freemail (dot) hu.
Special thanks
First and foremost, the author thanks Philip Hazel for his continual help and support.
Sorted as familiy name in alphabetic order:
Last modification: 09.07.2014 |
| |