Comparison of regular expression engine types

This page is under development, I plan to add more information in the future.

Introduction

Matching regular expressions is not considered a complex task and many people think that this problem was solved a long time ago. The reason of this belief is a great deal of misinformation which has been spread across the internet and the education. On the contrary there is no best solution for matching a pattern at the moment, and choosing the right regular expression engine is as difficult as choosing the right programming language. This page aims for showing the advantages and disadvantages of different engine types.

In general we have two engine types:
  • Performance oriented engines: very fast in general, and very slow in special (called pathological) cases.
  • Balanced engines: no weaknesses, no strengths. Usually predictable worst case runtime.
Balanced engines are slower than performance oriented engines, but adding some pathological patterns to the benchmark set can quickly turn the tide.

From technical point of view, we have several engine types:

Multiple choice engine with backtracking

Type: Usually performance oriented engines
Advantages:
  • Large feature set (assertions, conditional blocks, executing code blocks, backtracking control).
  • Submatch capture
  • Lots of optimization opportunities (mostly backtracking elimination).
Disadvantages:
  • Large and complex code base
  • May use a large amount of stack
  • Examples for pathological cases:
    • /(a*)*b/
    • /(?:a?){N}a{N}/ (N is a positive integer number)
Execution modes: Depth-first search algorithm
  • Interpreted execution of Nondeterministic Finite Automaton (NFA).
    Example: PCRE interpreter.
  • Generating machine code from NFA.
    Example: Irregexp engine.
  • Generating machine code from Abstract Syntax Tree (AST).
    Example: PCRE JIT compiler.

Single choice engine with pre-generated state machine

Type: Usually performance oriented engines
Advantages:
  • Simple and very fast matching algorithm
  • Partial matching support is easy
  • Multiple patterns can be matched at the same time (on a single core)
Disadvantages:
  • Large memory consumption (can be limited at the expense of performance)
  • Limited feature set (and this feature set is not fully supported, sometimes requires fallback to other processing modes)
  • Examples for pathological cases:
    • /a[^b]{N}a/ (N is a positive integer number)
    • /a[^b]{1,N}abcde/ (N is a positive integer number)
Execution modes: Linear search time, exponential state machine build time (especially for combining multiple patterns into a single state machine)
  • Following the state transitions of a Deterministic Finite Automaton (DFA)
    Example: RE2 engine.
  • DFA based Multi Pattern Matching
    Example: MPM library

Single choice engine with state tracking

Type: Usually balanced engines
Advantages:
  • No pathological cases
  • Partial matching support is not difficult
Disadvantages:
  • Matching speed is generally low due to the complex state update mechanism
  • Limited feature set (due to syncronization issues, they have a similar feature set as the engines with pre-generated state machine)
Execution modes: Linear search time
  • Interpreted execution of Deterministic Finite Automaton (DFA)
    Example: PCRE-DFA interpreter
  • Parallel matching of NFA
    Example: TRE engine

A performance comparison of some engines can be found here.
Last modification: 22.8.2013