Natural language parsing for the web

LeBlanc · on Aug 1, 2010

So I was out to dinner tonight and got an email from my VPS saying my disk IO was high. At first I thought someone might have hacked my server, but it turns out that someone posted the link to my site on HN!

I've been working on this website for the past couple of weeks. Please email me at andrew [AT] naturalparsing [DOT] com if you have any questions or suggestions, or want to use the API.

FYI, right now the API is not using all the capabilities of the Stanford Parser, just the word tagging part. More features will be implemented soon. Let me know if you have any specific requests.

Andrew

raffi · on Aug 1, 2010

Have you seen: http://www-tsujii.is.s.u-tokyo.ac.jp/~tsuruoka/postagger/

It's a POS tagger whose output is extremely close (read: identical) to the Stanford POS tagger.

It's also much faster. I was able to tag a corpus of 300K sentences with this in 15 minutes. With the Stanford POS tagger it took the entire weekend.

Sadly, this tool's license does not allow commercial use and it is not released under the GPL license.

ohashi · on Aug 1, 2010

I hope it was a pleasant reason to be having trouble :) Thanks and keep up the good work!

WildUtah · on Aug 1, 2010

Parser spins and spins with no output for me. Have we already overloaded it?

fauigerzigerk · on Aug 1, 2010

I'm waiting for the day when fruit flies finally get recognition as a species and fruit stops flying ;-)

gtani · on Aug 1, 2010

I cannot recommend this comment highly enough! I urge you to waste no time to read this comment over and over again.

finin · on Aug 1, 2010

I seem like it's just a part of speech tagger, rather than a parser.

epochwolf · on Aug 1, 2010

Wouldn't that make it a lexer then?

finin · on Aug 1, 2010

I suppose so, but for human languages it's much more difficult, since it assigning a type to a token is context dependent.

microtonal · on Aug 1, 2010

And there are unknown words (words that are not covered by the lexicon).

But the top-poster is right: the linked website does part of speech tagging, not parsing.

Providing a wide-coverage parser for the web is still hard. The number of possible parses for long sentences is enormous. Even if a sentence is not ambiguous to us, grammars allow all kinds of ambiguities.

There is a web demo for a Dutch system that is developed in our research group[1], but heap size and time limits are used, to exit gracefully if parsing takes too much time/memory (and sentences with more than 20 words are ignored, since you'll really want to do offline parsing).

[1] http://www.let.rug.nl/~vannoord/bin/alpino

sqrt17 · on Aug 1, 2010

Statistical methods do quite a good job of pushing that ambiguity back under the rug (see link to Berkeley parser further up, or the work on statistical disambiguation in Alpino).

As to the time consumption and complexity, that's a known problem of unification grammars (or just any grammar that does a little bit more) - but see this paper by Matsuzaki et al for efficient techniques to speed this up: http://www-tsujii.is.s.u-tokyo.ac.jp/~matuzaki/paper/ijcai20...

microtonal · on Aug 1, 2010

True, but you still have to build up a forest from which every parse can be extracted (as Alpino does). Of course, it does reduce the cost of ambiguity. E.g., I implemented packing in the Alpino chart generator, and there are very many edges with the same 'semantics' that can be packed, especially since Alpino allows for a lot of optional punctuation, and there are often roots plus subcategorization frames that give multiple inflections. With a beam search, unpacking is a lot more efficient.

There are also many other possible optimizations. E.g. with a left-corner parsers you can exclude left-corner spines that will probably not lead to a probable parse (Van Noord, Learning Efficient Parsing, 2009, and other work). And, of course, reduction of lexical ambiguity by restricting frames using a part of speech tagger.

Still, even in this best-1, optimized scenario, real-time parsing of long sentences is still hard. So, when parsing large corpora we usually apply time and space limits (which is easy to do in Sicstus Prolog, with good recovery).

Thanks for the link to the paper!

sqrt17 · on Aug 1, 2010

Nope. Parts of speech of words are ambiguous and you need some statistical method of disambiguation. Plus, you need something to guess the possible tags for unknown words (e.g. speakers of English are able to guess that 'supercalifragilistiexpialigetic' is an adjective even when they see it for the first time because they can generalize from other words with -etic.

Still, I wouldn't call part-of-speech tagging parsing.

Here's some parsing demo: http://nlp.cs.berkeley.edu/Main.html#parsing

olliesaunders · on Aug 1, 2010

The lexer is the bit that breaks up the string into tokens (words in this case).

AlexMuir · on July 31, 2010

This is exactly the sort od thing that makes a trip to HN worthwhile! Very cool, Thanks to high-school latin I was able to make a bit of sense of the table of parts of speech. Shame I haven't got a use for it ATM.

sqrt17 · on Aug 1, 2010

Darn, now I'm tempted to write a pure javascript POS tagger just to show that you don't really need anything server-side (or maybe just little bits here and there so the web page doesn't need to load a 20MB model right away - the computational effort, in any case, is not so bad that you couldn't do it in JS).

Hmm. Maybe sometime.

EDIT: The Stanford POS tagger is more complex and quite a bit slower than anything you'd do on your own. To quantify this, there's methods that are 10x as fast while sacrificing 0.05%-0.2% accuracy. (Or the easy ones that are 100x as fast, but are 1% less accurate - these would be fun to do in JS).

jacquesm · on Aug 1, 2010

Bold statements like that need to be backed up with code!

sqrt17 · on Aug 1, 2010

I don't have a JS version, but here's a functioning HMM tagger in under 300 lines of Python: http://gist.github.com/503784

(model loading doesn't work yet for some reason, but you see what it's doing in principle).

This uses a smoothed trigram HMM, so it should, in principle, be a bit better than NLTK's HMM tagger but not as good as serious POS tagging packages (e.g. hunpos, or the Stanford POS tagger)

arethuza · on Aug 1, 2010

To me this looks like a classic case of where it makes sense to keep this kind of processing on the server and make it available as a service - what functional advantages would having a POS tagger locally give you (apart from offline access)?

sqrt17 · on Aug 1, 2010

Once the model is loaded, you can get around the ~200ms roundtrip latency that it takes to connect the server. Obviously, there's lots of ways to hide these 200ms from the user.

microtonal · on Aug 1, 2010

Sure, it is possible, but what does it add? It's not as if a Hidden Markov Model tagger takes that many server resources. Or if you are really worried, you could build a unweighted finite state transducer (Roche, Schabes, 1995) ;).

jcsalterego · on July 31, 2010

Interesting, but I wonder how commercial use of the API requires the license:

http://otlportal.stanford.edu/techfinder/technology/ID=24472

gojomo · on July 31, 2010

From http://nlp.stanford.edu/software/tagger.shtml

The tagger is licensed under the GNU General Public License (v2 or later).

The GPL, like other licenses meeting the 'open source definition', has no restrictions on use -- only on proprietary distribution under nonfree licenses.

gtani · on Aug 1, 2010

lots chatter about MorphAdorner recently (relatively)

http://dhigger.blogspot.com/2009/08/research-project.html

http://workproduct.wordpress.com/2009/01/27/evaluating-pos-t...

pierrefar · on Aug 1, 2010

It doesn't seem to be handling multi-word proper nouns as I thought it would:

"Pride and Prejudice is a good book." becomes "Pride/NNP and/CC Prejudice/NNP is/VBZ a/DT good/JJ book/NN ./." I would have thought "Pride and Prejudice" would be lumped together.

raffi · on Aug 1, 2010

It's a little confusing but this looks like a front-end to Stanford's Part-of-Speech tagger. POS taggers do not group multi-word tokens. This would be the role of a chunker or a parser.

d_c · on Aug 1, 2010

Did you develop the underlying POS tagger or the web interface?

zepolen · on Aug 1, 2010

Shouldn't 'fuck' be a verb in this context?

go/VB fuck/NN yourself/PRP

microtonal · on Aug 1, 2010

Using my own tagger[1], trained using the Brown corpus:

go/VB fuck/VB yourself/PPL

It's very much in the amount and kind of training data and features used (assuming that the methodology is sound).

[1] http://github.com/langkit/citar

sqrt17 · on Aug 1, 2010

Pretty much. If you replace it with unambiguous words,

go/VB lemon/NN yourself/PRP

doesn't sound good, whereas

go/VB shave/VB yourself/PRP

is fine. IMO, go should also be a VBP, not a VB.

AlexMuir · on Aug 1, 2010

Imperative I think.