Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I suppose so, but for human languages it's much more difficult, since it assigning a type to a token is context dependent.


And there are unknown words (words that are not covered by the lexicon).

But the top-poster is right: the linked website does part of speech tagging, not parsing.

Providing a wide-coverage parser for the web is still hard. The number of possible parses for long sentences is enormous. Even if a sentence is not ambiguous to us, grammars allow all kinds of ambiguities.

There is a web demo for a Dutch system that is developed in our research group[1], but heap size and time limits are used, to exit gracefully if parsing takes too much time/memory (and sentences with more than 20 words are ignored, since you'll really want to do offline parsing).

[1] http://www.let.rug.nl/~vannoord/bin/alpino


Statistical methods do quite a good job of pushing that ambiguity back under the rug (see link to Berkeley parser further up, or the work on statistical disambiguation in Alpino).

As to the time consumption and complexity, that's a known problem of unification grammars (or just any grammar that does a little bit more) - but see this paper by Matsuzaki et al for efficient techniques to speed this up: http://www-tsujii.is.s.u-tokyo.ac.jp/~matuzaki/paper/ijcai20...


True, but you still have to build up a forest from which every parse can be extracted (as Alpino does). Of course, it does reduce the cost of ambiguity. E.g., I implemented packing in the Alpino chart generator, and there are very many edges with the same 'semantics' that can be packed, especially since Alpino allows for a lot of optional punctuation, and there are often roots plus subcategorization frames that give multiple inflections. With a beam search, unpacking is a lot more efficient.

There are also many other possible optimizations. E.g. with a left-corner parsers you can exclude left-corner spines that will probably not lead to a probable parse (Van Noord, Learning Efficient Parsing, 2009, and other work). And, of course, reduction of lexical ambiguity by restricting frames using a part of speech tagger.

Still, even in this best-1, optimized scenario, real-time parsing of long sentences is still hard. So, when parsing large corpora we usually apply time and space limits (which is easy to do in Sicstus Prolog, with good recovery).

Thanks for the link to the paper!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: