You Cannot Parse HTML with Regular Expressions

jmount · on April 18, 2010

"Some people, when confronted with a problem, think 'I know, I'll use regular expressions.' Now they have two problems." Jamie Zawinski 1997.

83457 · on April 18, 2010

No one should ever use regular expressions because there is a cute quote that says not too.

jmount · on April 19, 2010

The problem is that regular expressions (in addition not being powerful enough to properly handle HTML) are a bit of a maintenance headache. They look like line noise. Most regular expressions don't change properly with local (did you use \w, [a-zA-Z] or \p{Alpha} and which of these did you really mean). And you can't always be sure the regexp is implemented efficiently (by compiling a finite automata instead of search, see http://swtch.com/~rsc/regexp/regexp1.html ). Often a parser generator (of which a lexer is often the first step, which is as powerful as regular expressions) is a much better solution.

fmota · on April 18, 2010

Also, because regular expressions are not the right tool for the job sometimes.

("No one should ever use regular expressions" is hyperbolic and wrong.)

james2vegas · on April 18, 2010

"Some people, when confronted with a problem, think 'I know, I'll quote Jamie Zawinski.' Now they have two problems."

util · on April 18, 2010

Is it right that BeautifulSoup was (is?) implemented in terms of Python regular expressions?

rbonvall · on April 18, 2010

Regular expressions can be (and are) used to tokenize the code, but cannot do the actual parsing.

What "REs can't parse HTML" means in theory is that you cannot design a regexp that tells HTML apart from non-HTML.

The fundamental reason is that regexps cannot detect arbitrarily-deep nested structures.

In practice it is possible, because most regexp engines are Turing complete, but it would be crazy to do so.