|
Comparison of regular expression engine types
This page is under development, I plan to add more information in the future.
Introduction
Matching regular expressions is not considered a complex task and many people think
that this problem was solved a long time ago. The reason of this belief is a great deal
of misinformation which has been spread across the internet and the education. On the
contrary there is no best solution for matching a pattern at the moment, and choosing
the right regular expression engine is as difficult as choosing the right programming
language. This page aims for showing the advantages and disadvantages of different
engine types.
In general we have two engine types:
- Performance oriented engines: very fast in general, and very slow in special
(called pathological) cases.
- Balanced engines: no weaknesses, no strengths. Usually predictable worst case runtime.
Balanced engines are slower than performance oriented engines, but adding some
pathological patterns to the benchmark set can quickly turn the tide.
From technical point of view, we have several engine types:
Multiple choice engine with backtracking
Type:
|
Usually performance oriented engines
|
Advantages:
|
- Large feature set (assertions, conditional blocks, executing code blocks,
backtracking control).
- Submatch capture
- Lots of optimization opportunities (mostly backtracking
elimination).
|
Disadvantages:
|
- Large and complex code base
- May use a large amount of stack
- Examples for pathological cases:
- /(a*)*b/
- /(?:a?){N}a{N}/ (N is a positive integer number)
|
Execution modes:
|
Depth-first search algorithm
- Interpreted execution of Nondeterministic Finite Automaton (NFA).
Example: PCRE interpreter.
- Generating machine code from NFA.
Example: Irregexp engine.
- Generating machine code from Abstract Syntax Tree (AST).
Example: PCRE JIT compiler.
|
Single choice engine with pre-generated state machine
Type:
|
Usually performance oriented engines
|
Advantages:
|
- Simple and very fast matching algorithm
- Partial matching support is easy
- Multiple patterns can be matched at the same time (on a single core)
|
Disadvantages:
|
- Large memory consumption (can be limited at the expense of performance)
- Limited feature set (and this feature set is not fully supported, sometimes
requires fallback to other processing modes)
- Examples for pathological cases:
- /a[^b]{N}a/ (N is a positive integer number)
- /a[^b]{1,N}abcde/ (N is a positive integer number)
|
Execution modes:
|
Linear search time, exponential state machine build time (especially for combining multiple
patterns into a single state machine)
- Following the state transitions of a Deterministic Finite Automaton (DFA)
Example: RE2 engine.
- DFA based Multi Pattern Matching
Example: MPM library
|
Single choice engine with state tracking
Type:
|
Usually balanced engines
|
Advantages:
|
- No pathological cases
- Partial matching support is not difficult
|
Disadvantages:
|
- Matching speed is generally low due to the complex state update mechanism
- Limited feature set (due to syncronization issues, they have a similar
feature set as the engines with pre-generated state machine)
|
Execution modes:
|
Linear search time
- Interpreted execution of Deterministic Finite Automaton (DFA)
Example: PCRE-DFA interpreter
- Parallel matching of NFA
Example: TRE engine
|
A performance comparison of some engines can be found here.
Last modification: 22.8.2013 |
| |