Lenny Domnitser’s
domnit.org

⇙ Jump to content

explicit

This is a static archive of the domnit.org blog,
which Lenny Domnitser wrote between 2006 and 2009.

Stepic 0.2, Project Page

Stepic—the Python image steganography tool and library—has a release that actually, you know, works: 0.2. Info and downloads are available on the brand new, as of yet sparse, project page for Stepic.

Americanisms

It’s old news, but I just learned that in British English, through most of the twentieth century and before, numbers had different names from American English. I assumed that the prefix to a -llion number was always derived from (log10 n − 1) ÷ 3, or the number of thousands of thousands, but Brits used the simpler log10 n ÷ 6, or just the number of millions. A million is still a million in both systems, but, for example, an American septillion is a traditional British quadrillion, both of which equal 1024.

The American “short scale” has finally become the standard for English, but other countries and languages use the long scale.

We live in a world where it’s not even safe to spell numbers.

Quote from My Automata Prof

“If you’re thinking about how to do that with a context-free grammar, it should be pretty hard because it’s impossible.”

Captcha

To prove that you are clever, please type the word below.

Update: It’s just an idea. Typing the right word(s) does nothing.

Stepic Redux

Stepic is my Python steganography tool. It’s lacking a real homepage, but I introduced it here. Some new stuff:

For now, the -bzr and -0.1 versions are the same, but 0.1 will always stay the same, while the “nightly” release will always have the latest and greatest development code.

That’s all.

String Spin Art

String Spin is fun and beautiful

Regular Expression Security: Don’t Accept Arbitrary Regexes

Most regular expression implementations (PCRE, Perl, Python, PHP, Ruby, Java, etc.) can very easily be made to hold up an application indefinitely. A website that allows a user to input regular expressions is vulnerable to a denial of service attack. This fact is public knowledge, but Googling didn’t bring up any straightforward “don’t touch the stove” language about this topic, so here it is. If you expose these regex implementations to the public, you will get burned.

Some background: regular expressions originate in automata theory, the math side of computer science. (I’m putting off my automata homework right now!) They are a way to describe machines that take in an input string, and answer whether that string matches the machine’s “language”. If this sounds familar, it’s because that is what regular expressions are used for in practical programming. If you are validating or searching some input string against a regex, you are essentially asking, is this string in the language I’m describing?

Modern regexes, as used in programming, are not really regular expressions, according to the mathematical definition. Given a real regular expression, one can programmatically construct one of the machines I described, which, skipping the details, can be damn fast. Modern regexes, when using some of their more powerful features, do not necessarily correspond to these machines, and can be very slow. The crux of the difference is that the simple, fast method can test all possible matches in parallel, while the potentially slow method must try every case sequentially.

All of this was only recently brought to my attention by a recent article, Regular Expression Matching Can Be Simple And Fast, by Russ Cox. It’s a long read, but worth the time if you wish to understand how, as he puts it, “good theory leads to good programs.”

There’s no sense in (poorly) repeating Cox, so read his article if you are interested in details. Here’s the takeaway—what’s wrong, what to do:

I’ll shamelessly rip off Cox’s example of a dangerous regular expression: matching ^(a?){n}a{n}$ against an (a repeated n times). Plug in any number for n, so this could mean matching ^(a?){10}a{10}$ against aaaaaaaaaa. This seems to work for 10, but for n = 23, matching already took over a second on my computer, using the slow method. (I’m using plain old GNU egrep for testing the fast method, and I did up a quick grep.py for testing the slow method.) n = 26 takes 99% CPU for over ten seconds. Cox estimates that for n = 100, the slow method would take a quadrillion years (cf. presumed age of the Universe). I decided not to try to test that, but the I tried n = 100 with the fast method, and there was no noticeable delay. The delay for n = 400 was noticeable, but sub-second.

A couple quadrillion-year calculations could reasonably tie up an unprotected server. Some web servers automatically kill processes that last more than a few seconds, but why rely on some daemon? The application should ensure that it does not run that long. Two strategies arise:

  1. Use the fast method. You could try to find or implement the fast method for your language of choice. Or, you could use trusty old egrep. This will only help if you do not rely on advanced features like backreferences (things like (a|b)c\1).
  2. Stop execution after a second or so. I’m no expert at this, but the idea is to run the regex operation in a new process or thread, and kill it if it runs too long. This means the calculation might not finish, but then again, it probably wouldn’t finish within a quadrillion years anyway, given the likelihood in that timeframe of a catastrophic collision with a comet, or of somebody tripping over the power cable.

A third strategy would be to offload the regex processing to the client, but this is not always feasible. Also, push core features of a website into Javascript-land at your own peril.

Of course, the best advice is in the title: don’t accept arbitrary regexes. There are few scenarios where taking input of a regular expression is even useful. Most jobs of validation or extracting data from strings is done using regexes written by the programmer ahead of time. A scenario when regex input actually is very useful is search. Unfortunately, if you wish to provide this functionality, you can’t use the powerful fake/modern regular expressions, and you probably can’t use the standard regular expression library that comes with your language.

Cox is right that regular expression matching can be simple and fast, but realistically it often is not. Accepting arbitrary regexes is to be considered dangerous unless care is taken to fully understand and implement a safe solution. In the current state of libraries, safe is not the default.

Update: Marc Schoenefeld writes in with a presentation about Java security that includes exponential compile time for regexes.