The problem is that regular expressions (in addition not being powerful enough to properly handle HTML) are a bit of a maintenance headache. They look like line noise. Most regular expressions don't change properly with local (did you use \w, [a-zA-Z] or \p{Alpha} and which of these did you really mean). And you can't always be sure the regexp is implemented efficiently (by compiling a finite automata instead of search, see http://swtch.com/~rsc/regexp/regexp1.html ). Often a parser generator (of which a lexer is often the first step, which is as powerful as regular expressions) is a much better solution.