Regular Expression Security: Don’t Accept Arbitrary Regexes
Most regular expression implementations (PCRE, Perl, Python, PHP, Ruby, Java, etc.) can very easily be made to hold up an application indefinitely. A website that allows a user to input regular expressions is vulnerable to a denial of service attack. This fact is public knowledge, but Googling didn’t bring up any straightforward “don’t touch the stove” language about this topic, so here it is. If you expose these regex implementations to the public, you will get burned.
Some background: regular expressions originate in automata theory, the math side of computer science. (I’m putting off my automata homework right now!) They are a way to describe machines that take in an input string, and answer whether that string matches the machine’s “language”. If this sounds familar, it’s because that is what regular expressions are used for in practical programming. If you are validating or searching some input string against a regex, you are essentially asking, is this string in the language I’m describing?
Modern regexes, as used in programming, are not really regular expressions, according to the mathematical definition. Given a real regular expression, one can programmatically construct one of the machines I described, which, skipping the details, can be damn fast. Modern regexes, when using some of their more powerful features, do not necessarily correspond to these machines, and can be very slow. The crux of the difference is that the simple, fast method can test all possible matches in parallel, while the potentially slow method must try every case sequentially.
All of this was only recently brought to my attention by a recent article, Regular Expression Matching Can Be Simple And Fast, by Russ Cox. It’s a long read, but worth the time if you wish to understand how, as he puts it, “good theory leads to good programs.”
There’s no sense in (poorly) repeating Cox, so read his article if you are interested in details. Here’s the takeaway—what’s wrong, what to do:
I’ll shamelessly rip off Cox’s example of a dangerous regular expression: matching ^(a?){n}a{n}$
against an (a repeated n times). Plug in any number for n, so this could mean matching ^(a?){10}a{10}$
against aaaaaaaaaa. This seems to work for 10, but for n = 23, matching already took over a second on my computer, using the slow method. (I’m using plain old GNU egrep for testing the fast method, and I did up a quick grep.py for testing the slow method.) n = 26 takes 99% CPU for over ten seconds. Cox estimates that for n = 100, the slow method would take a quadrillion years (cf. presumed age of the Universe). I decided not to try to test that, but the I tried n = 100 with the fast method, and there was no noticeable delay. The delay for n = 400 was noticeable, but sub-second.
A couple quadrillion-year calculations could reasonably tie up an unprotected server. Some web servers automatically kill processes that last more than a few seconds, but why rely on some daemon? The application should ensure that it does not run that long. Two strategies arise:
- Use the fast method. You could try to find or implement the fast method for your language of choice. Or, you could use trusty old egrep. This will only help if you do not rely on advanced features like backreferences (things like
(a|b)c\1
). - Stop execution after a second or so. I’m no expert at this, but the idea is to run the regex operation in a new process or thread, and kill it if it runs too long. This means the calculation might not finish, but then again, it probably wouldn’t finish within a quadrillion years anyway, given the likelihood in that timeframe of a catastrophic collision with a comet, or of somebody tripping over the power cable.
A third strategy would be to offload the regex processing to the client, but this is not always feasible. Also, push core features of a website into Javascript-land at your own peril.
Of course, the best advice is in the title: don’t accept arbitrary regexes. There are few scenarios where taking input of a regular expression is even useful. Most jobs of validation or extracting data from strings is done using regexes written by the programmer ahead of time. A scenario when regex input actually is very useful is search. Unfortunately, if you wish to provide this functionality, you can’t use the powerful fake/modern regular expressions, and you probably can’t use the standard regular expression library that comes with your language.
Cox is right that regular expression matching can be simple and fast, but realistically it often is not. Accepting arbitrary regexes is to be considered dangerous unless care is taken to fully understand and implement a safe solution. In the current state of libraries, safe is not the default.
Update: Marc Schoenefeld writes in with a presentation about Java security that includes exponential compile time for regexes.