Performance comparison of regular expression engines

Introduction

Processing text or raw byte-sequence are among the common tasks performed by most software tools. These tasks usually involve pattern matching algorithms, and the most popular tool for such purpose is regular expressions. Regular expressions has been evolved a lot since Kleene defined the regular sets in the 1950's. Today we have several widely used regular expression engines which have different features which makes any performance comparison a difficult task, since a faster engine is not necessary better. Depending on the use case it might be enough to know whether a POSIX compatible regular expression matches to a line, even the position of the match is unneeded (grep utility). On the contrary other use cases require the position of capturing brackets, unicode support, conditional and atomic block (handling a byte sequence as a single character, like 'sch' in German language) support just to name a few. The former case needs a less sophisticated algorithm, which is likely be much faster than the latter, but again, that does not mean the former is better. More about these engine types can be found here.

Participants

The following popular engines were choosen: Before anyone jump to any conclusions, I should note the followings:
  • The engines were not fine tuned (because of my lack of knowledge about their internal workings). I just compiled them with the default options. I know enabling or disabling some features can heavily affect the results. If you feel that you have a better configuration just drop me an e-mail and I will update the results (hzmester(at)freemail(dot)hu).
  • The regular expression engines are compiled with -O3 to allow the best performance.
  • This comparison page was inspired by the work of John Maddock (See his own regex comparison here). The input is also the same he used before: mtent12.zip. It is a text file (e-book) which size is about 20 Mbytes.
  • Only common patterns are selected, they are not pathological cases nor have any PERL specific features. The comparison was caseful.

Results

x86-64 bit mode on an Intel Xeon 2.67GHz (GCC 4.8.2, Linux)

Regular expressionPCREPCRE
-DFA
TREOnig-
uruma
RE2PCRE
-JIT
Twain5 ms20 ms486 ms16 ms3 ms16 ms
(?i)Twain79 ms124 ms598 ms160 ms73 ms16 ms
[a-z]shing564 ms997 ms724 ms15 ms113 ms14 ms
Huck[a-zA-Z]+|Saw[a-zA-Z]+30 ms32 ms711 ms43 ms59 ms3 ms
\b\w+nn\b837 ms1453 ms1186 ms1083 ms59 ms113 ms
[a-q][^u-z]{13}x746 ms2741 ms1928 ms61 ms3512 ms2 ms
Tom|Sawyer|Huckleberry|Finn40 ms42 ms1242 ms51 ms61 ms29 ms
(?i)Tom|Sawyer|Huckleberry|Finn424 ms489 ms1887 ms820 ms98 ms94 ms
.{0,2}(Tom|Sawyer|Huckleberry|Finn)5164 ms5742 ms3425 ms127 ms66 ms443 ms
.{2,4}(Tom|Sawyer|Huckleberry|Finn)5298 ms7097 ms5248 ms124 ms66 ms487 ms
Tom.{10,25}river|river.{10,25}Tom83 ms108 ms804 ms90 ms68 ms18 ms
[a-zA-Z]+ing1373 ms2224 ms954 ms1420 ms129 ms92 ms
\s[a-zA-Z]{0,12}ing\s592 ms1007 ms1329 ms78 ms82 ms140 ms
([A-Za-z]awyer|[A-Za-z]inn)\s1112 ms1489 ms1313 ms243 ms111 ms46 ms
["'][^"']{0,30}[?!\.]["']65 ms114 ms687 ms91 ms63 ms15 ms

x86-32 bit mode on an Intel Xeon 2.67GHz (GCC 4.8.2, Linux)

Regular expressionPCREPCRE
-DFA
TREOnig-
uruma
RE2PCRE
-JIT
Twain5 ms20 ms548 ms17 ms4 ms18 ms
(?i)Twain97 ms144 ms683 ms139 ms74 ms16 ms
[a-z]shing594 ms996 ms766 ms15 ms107 ms14 ms
Huck[a-zA-Z]+|Saw[a-zA-Z]+26 ms27 ms822 ms49 ms60 ms3 ms
\b\w+nn\b978 ms1494 ms1205 ms1151 ms59 ms114 ms
[a-q][^u-z]{13}x770 ms3112 ms2080 ms61 ms780 ms2 ms
Tom|Sawyer|Huckleberry|Finn38 ms37 ms1346 ms58 ms61 ms29 ms
(?i)Tom|Sawyer|Huckleberry|Finn464 ms481 ms1853 ms663 ms93 ms93 ms
.{0,2}(Tom|Sawyer|Huckleberry|Finn)6726 ms5636 ms3641 ms139 ms70 ms406 ms
.{2,4}(Tom|Sawyer|Huckleberry|Finn)7107 ms7252 ms5700 ms136 ms70 ms434 ms
Tom.{10,25}river|river.{10,25}Tom85 ms105 ms904 ms106 ms69 ms18 ms
[a-zA-Z]+ing1740 ms2369 ms892 ms1387 ms144 ms90 ms
\s[a-zA-Z]{0,12}ing\s708 ms968 ms1271 ms98 ms97 ms167 ms
([A-Za-z]awyer|[A-Za-z]inn)\s1300 ms1479 ms1583 ms255 ms104 ms45 ms
["'][^"']{0,30}[?!\.]["']73 ms118 ms765 ms100 ms65 ms14 ms

x86-64 bit mode on an Intel Core i5 3.2GHz (GCC 4.4.7, Windows)

Regular expressionPCREPCRE
-DFA
TREOnig-
uruma
RE2PCRE
-JIT
Twain7 ms12 ms315 ms20 ms8 ms16 ms
(?i)Twain47 ms90 ms414 ms110 ms131 ms16 ms
[a-z]shing330 ms704 ms485 ms19 ms131 ms15 ms
Huck[a-zA-Z]+|Saw[a-zA-Z]+20 ms22 ms518 ms38 ms130 ms2 ms
\b\w+nn\b521 ms987 ms1109 ms782 ms131 ms85 ms
[a-q][^u-z]{13}x428 ms1852 ms1168 ms43 ms3316 ms1 ms
Tom|Sawyer|Huckleberry|Finn25 ms29 ms863 ms44 ms132 ms23 ms
(?i)Tom|Sawyer|Huckleberry|Finn239 ms358 ms1310 ms424 ms132 ms68 ms
.{0,2}(Tom|Sawyer|Huckleberry|Finn)2970 ms3583 ms2170 ms85 ms133 ms268 ms
.{2,4}(Tom|Sawyer|Huckleberry|Finn)2965 ms4134 ms3298 ms84 ms136 ms296 ms
Tom.{10,25}river|river.{10,25}Tom54 ms81 ms655 ms80 ms138 ms14 ms
[a-zA-Z]+ing773 ms1686 ms567 ms922 ms181 ms70 ms
\s[a-zA-Z]{0,12}ing\s338 ms685 ms864 ms76 ms167 ms96 ms
([A-Za-z]awyer|[A-Za-z]inn)\s664 ms1087 ms875 ms182 ms139 ms35 ms
["'][^"']{0,30}[?!\.]["']44 ms78 ms456 ms79 ms142 ms10 ms

The testing environment can be downloaded from here. Just extract, type make. Download the mtent12.zip, rename the mtent12.txt inside to mark.txt and run runtest.

Last modification: 23.8.2015