Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
You Cannot Parse HTML with Regular Expressions (stackoverflow.com)
30 points by johnnyg on April 18, 2010 | hide | past | favorite | 7 comments


"Some people, when confronted with a problem, think 'I know, I'll use regular expressions.' Now they have two problems." Jamie Zawinski 1997.


No one should ever use regular expressions because there is a cute quote that says not too.


The problem is that regular expressions (in addition not being powerful enough to properly handle HTML) are a bit of a maintenance headache. They look like line noise. Most regular expressions don't change properly with local (did you use \w, [a-zA-Z] or \p{Alpha} and which of these did you really mean). And you can't always be sure the regexp is implemented efficiently (by compiling a finite automata instead of search, see http://swtch.com/~rsc/regexp/regexp1.html ). Often a parser generator (of which a lexer is often the first step, which is as powerful as regular expressions) is a much better solution.


Also, because regular expressions are not the right tool for the job sometimes.

("No one should ever use regular expressions" is hyperbolic and wrong.)


"Some people, when confronted with a problem, think 'I know, I'll quote Jamie Zawinski.' Now they have two problems."


Is it right that BeautifulSoup was (is?) implemented in terms of Python regular expressions?


Regular expressions can be (and are) used to tokenize the code, but cannot do the actual parsing.

What "REs can't parse HTML" means in theory is that you cannot design a regexp that tells HTML apart from non-HTML.

The fundamental reason is that regexps cannot detect arbitrarily-deep nested structures.

In practice it is possible, because most regexp engines are Turing complete, but it would be crazy to do so.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: