PCRE Performance Project

About

The aim of PCRE-sljit project is speeding up the pattern matching speed of Perl Compatible Regular Expressions library (ftp download). The task is achieved by using sljit, a just-in-time (JIT) compilation library to translate machine code from the internal byte-code representation generated by pcre_compile(). PCRE-sljit offers similar matching speed to DFA based engines (like re2) on common patterns but still keep PERL compatibility (see here).

This work has been released as part of PCRE 8.20 and above. The JIT was improved a lot in 8.32, and a new so called native interface was introduced. See the results.

About performance optimizations

Always do profiling! PCRE-JIT can only help you, if matching regular expressions takes at least 4-5% of the total runtime. Otherwise you might not see any performance increase (or you will see performance drops) due to the changes of the binary layout. It is unfortunately less known, that inserting nops can increase or decrease the runtime of any program by up to 3%, due the the CPU cache layout, branch prediction mechanisms, etc. In artificial cases, the runtime change can be even bigger (±50% for example). When any function is modified, even if the change is small, it affects the entire binary layout, since the entry offset of other functions will be changed as well (especially those, which are placed after this function in the executable by the linker). Therefore when the ratio of matching regular expressions is very low, you might experience a slight performance drop when using PCRE-JIT.

Usage

Because of compatibility reasons the JIT compilation must be explicitly requested by software. First of all PCRE must be compiled with --enable-jit. (Note: JIT compiler availability can be checked runtime by passing PCRE_CONFIG_JIT to pcre_config().) The next step is performing the JIT compilation by passing PCRE_STUDY_JIT_COMPILE, PCRE_STUDY_JIT_PARTIAL_SOFT_COMPILE or PCRE_STUDY_JIT_PARTIAL_HARD_COMPILE to pcre_study(). Regardless of the success of JIT compilation, the returned extra data can be passed to pcre_exec(), which use the machine code when approprite, or fallback to the interpreted code path (Note: a few patterns are not supported by the JIT compiler). Passing PCRE_INFO_JIT to pcre_fullinfo() can be used to check whether the JIT compilation is successful. If the JIT compilation is succesful pcre[16|32]_jit_exec can be used to reach even better performance. It is also mandatory to use pcre_free_study() to free the machine code when the JIT compilation is succesful. Otherwise pcre_free_study() is optional for compatibility reasons, but we suggest to use it all the time.

Detailed info about the JIT compiler can be found in pcrejit.3, which is part of the PCRE documentation. It is advisable to read the "CONTROLLING THE JIT STACK" section there.

Motivation

The importance of pattern matching has been significantly grown in the last decade. The increased computation power allows running more complex regular expressions in an acceptable time frame. However, this is not the only way to speed up regular expressions. The developers of JavaScript engines realized that their just-in-time compilation engines are able to considerably improve the runtime of pattern matching. Thus, JIT based regular expression engines were developed like irregexp from Google and YARR (Yet Another Regex Runtime) from Apple. However, these engines are heavily tied to their JavaScript engine (V8 from Google and JavaScriptCore from Apple), and other software cannot benefit from this shiny new technology, until now. The aim of this work is to extend a widely used regular expression library (PCRE) with a JIT compiler engine, which can be easly adopted by any software already using PCRE.

How does it work?

First, a regular expression is compiled into an internal representation by pcre_compile(). The internal representation (usually called MIR - Middle Level Representation) is a sequence of byte-codes. Each byte-code is basically a command, which tells the next step for the interpreter inside pcre_exec(). The JIT compiler translates MIR to machine code when the appropriate flags are passed to pcre_study(). The returned pcre_extra data contains a pointer to a machine executable function if the machine code generation was successful.

Why is it faster?

JIT compilers totally eliminate the continual reparsing of MIR. Even if the MIR code is much simpler than the original pattern string, the execution engine is full of ifs and switches, and executing them consumes considerable time. The compiled machine code only contains those machine instructions whose are absolutely necessary for this particular pattern, and nothing more.

This advantage is also a limitation since JIT compiler does not support the changing of compile time flags. One such flag is the newline type (PCRE supports many of them). PCRE allows overwriting the compile time newline flags when pcre_exec() is called. This is not supported by the JIT compiler, and the regular expression will be fallbacks to the interpreter.

Compile time overhead

Deciding when to use or not use JIT compiling is an important question. Since JIT is a heavyweight optimization we should never forget about the compile time overhead. Thus, the total runtime can be bigger if we use the compiled expression only once on a small input.

The following values measured on an Intel 2.67GHz with GCC 4.4.5 in 64 bit mode
Note: ns means nanosecond (10^-9), and Int. means interpreter (pcre_exec())

Pattern	Compile time (ns)	JIT compile time (ns)	JIT compile / compile time
abc	264	7932	30.05
a+(b*)[c\|d]??e+?	925	17186	18.58
\A.*\K[^a-z!]{8,}\Z	925	11501	12.43
\b((.)\w{3,}\2{0,3})$	1057	21747	20.57
(?P<cap>\p{N}+\|(?(R)a\|(?R)))\1	1322	30869	23.35
^(?![a-zA-Z]*?[[:alpha:]]++)\P{L&}	1520	20425	13.44

Results

The following patterns are searched on a non-utf8 input and an utf8 input. The utf8 input was shorter so it was repeated 4 times (with a simple memcpy()).

The performace is measured on an Intel 2.67GHz with GCC 4.4.5 in both 32 and 64 bit mode. The unit of the runtime values are milliseconds (ms) which is 10^-3 second. The PCRE library was compiled with -O3 and -static compiler options to get the best performance form the compiler.

Description of the colums

Pattern: the pattern string. The light blue lines are caseless, the light yellow lines are caseful matches.
Int. (ms): interpreter (normal pcre_exec(...)) runtime in milliseconds.
JIT (ms): JIT compiled machine code runtime in milliseconds.
Runtime ratio: Interpreter runtime / JIT runtime.
Runtime save %: Runtime saved by the JIT compiler. Equals to 1 - (JIT runtime / Interpreter runtime)

Pattern	Int. (ms)	JIT (ms)	Runtime ratio	Runtime save %
how to	60	40	1.50	33.3%
^how to	150	20	7.50	86.7%
how( to\|ever)	60	30	2.00	50.0%
walk\|get\|she\|make	240	70	3.43	70.8%
[Ww]alk\|[Gg]et\|[Ss]he\|[Mm]ake	250	90	2.78	64.0%
(?:ba\|lu\|ro)+(?:r\|ck)*	210	70	3.00	66.7%
(?:ba\|lu\|ro)+(r\|ck)*	210	70	3.00	66.7%
\w+our\w*	1680	410	4.10	75.6%
(?:^\w+$)\|\b(\W)\1+\b	1630	310	5.26	81.0%
"[A-Z][a-z](?:(?:[,\s]\|\R)(\s\|\R)[a-z]+){0,4}?[.!]"	90	40	2.25	55.6%
\b(?:(?=\w+ro\w+)(?=\w+th\w+)\w+hr\w+)\b	1200	300	4.00	75.0%
\b(?(?=\w+sh)\w+a\|\w+uo)\w+\b	1870	370	5.05	80.2%
\b\WX{0,3}(?:(?:V?I{1,3})\|V\|IV\|IX)\W\b	820	180	4.56	78.0%
((\w)+?)\.\b(\s)*?\w	4380	540	8.11	87.7%
(.{1,3})(\w*?\1)(?2)	6060	1530	3.96	74.8%
-?-?-?-?-?-?-?-?-?-?-?-----------$	600	100	6.00	83.3%

Average:	1219	260	4.16	70.6%

Non-utf8 input on a 32 bit x86 machine.

Pattern	Int. (ms)	JIT (ms)	Runtime ratio	Runtime save %
how to	60	40	1.50	33.3%
^how to	180	40	4.50	77.8%
how( to\|ever)	60	20	3.00	66.7%
walk\|get\|she\|make	300	80	3.75	73.3%
[Ww]alk\|[Gg]et\|[Ss]he\|[Mm]ake	280	100	2.80	64.3%
(?:ba\|lu\|ro)+(?:r\|ck)*	220	80	2.75	63.6%
(?:ba\|lu\|ro)+(r\|ck)*	240	80	3.00	66.7%
\w+our\w*	2020	360	5.61	82.2%
(?:^\w+$)\|\b(\W)\1+\b	2060	340	6.06	83.5%
"[A-Z][a-z](?:(?:[,\s]\|\R)(\s\|\R)[a-z]+){0,4}?[.!]"	80	40	2.00	50.0%
\b(?:(?=\w+ro\w+)(?=\w+th\w+)\w+hr\w+)\b	1340	300	4.47	77.6%
\b(?(?=\w+sh)\w+a\|\w+uo)\w+\b	2060	360	5.72	82.5%
\b\WX{0,3}(?:(?:V?I{1,3})\|V\|IV\|IX)\W\b	940	160	5.88	83.0%
((\w)+?)\.\b(\s)*?\w	5660	520	10.88	90.8%
(.{1,3})(\w*?\1)(?2)	7200	1620	4.44	77.5%
-?-?-?-?-?-?-?-?-?-?-?-----------$	680	100	6.80	85.3%

Average:	1461	265	4.57	72.4%

Non-utf8 input on a 64 bit x86 machine.

Pattern	Int. (ms)	JIT (ms)	Runtime ratio	Runtime save %
die der	10	10	1.00	0.0%
ist\|der\|die\|und	100	20	5.00	80.0%
\b\w+\b	220	140	1.57	36.4%
(?:da\|ge\|om)+(?:n\|me)*	60	20	3.00	66.7%
\b(?(?=\w+ro)\w+pa\|\w+lle)\w+\b	560	300	1.87	46.4%
\b(\W)\1+\b\|(^(?=.kl)(?=.no).{15,40}$)	680	200	3.40	70.6%
^.{4,32}(\P{N})\1{2,}.{4,32}(?<![nuk])$	240	60	4.00	75.0%
^(\w{3,})(?!\1).\h.\1$	5000	2180	2.29	56.4%
((\w{2,8},?(\P{Z}\|\R)){1,2}\.\s?)$	4320	1380	3.13	68.1%
\b\w?((.){1,3}\w\2)\w*?(?1)	1720	700	2.46	59.3%
\w?(b{2,3})\w?c	1120	400	2.80	64.3%
\P{Lu}\P{L&}{0,12}[\s\-]{1,4}..[\P{L}\P{N}]{4}	300	100	3.00	66.7%
\b(\B([c-h])\B\|[a-z]+?(?1)[a-z])	900	280	3.21	68.9%

Average:	1172	446	2.83	58.4%

Utf8 input on a 32 bit x86 machine.

Pattern	Int. (ms)	JIT (ms)	Runtime ratio	Runtime save %
die der	10	10	1.00	0.0%
ist\|der\|die\|und	100	30	3.33	70.0%
\b\w+\b	220	110	2.00	50.0%
(?:da\|ge\|om)+(?:n\|me)*	80	20	4.00	75.0%
\b(?(?=\w+ro)\w+pa\|\w+lle)\w+\b	540	230	2.35	57.4%
\b(\W)\1+\b\|(^(?=.kl)(?=.no).{15,40}$)	640	170	3.76	73.4%
^.{4,32}(\P{N})\1{2,}.{4,32}(?<![nuk])$	200	50	4.00	75.0%
^(\w{3,})(?!\1).\h.\1$	4140	1710	2.42	58.7%
((\w{2,8},?(\P{Z}\|\R)){1,2}\.\s?)$	4090	1080	3.79	73.6%
\b\w?((.){1,3}\w\2)\w*?(?1)	1490	550	2.71	63.1%
\w?(b{2,3})\w?c	1130	290	3.90	74.3%
\P{Lu}\P{L&}{0,12}[\s\-]{1,4}..[\P{L}\P{N}]{4}	270	80	3.38	70.4%
\b(\B([c-h])\B\|[a-z]+?(?1)[a-z])	750	240	3.12	68.0%

Average:	1050	351	3.06	62.2%

Utf8 input on a 64 bit x86 machine.

Conclusion

JIT compiling is a powerful technology, which is able to speed up interpreted execution including Java, JavaScript, ActionScript (with NanoJIT) or regular expression engines.

Contribution

I would like to encourage all of you to join the PCRE Performance Project, and make PCRE faster than ever. My email address is hzmester (at) freemail (dot) hu.

Special thanks

First and foremost, the author thanks Philip Hazel for his continual help and support.

Sorted as familiy name in alphabetic order:

Giuseppe D'Angelo	-	QRegularExpression in Qt 5 and above.
Victor Julien	-	Suricata project (http://www.openinfosecfoundation.org/).
Sheri Pierce	-	Testing and feedback.
Petr Pisar

Last modification: 09.07.2014