So I was out to dinner tonight and got an email from my VPS saying my disk IO was high. At first I thought someone might have hacked my server, but it turns out that someone posted the link to my site on HN!
I've been working on this website for the past couple of weeks. Please email me at andrew [AT] naturalparsing [DOT] com if you have any questions or suggestions, or want to use the API.
FYI, right now the API is not using all the capabilities of the Stanford Parser, just the word tagging part. More features will be implemented soon. Let me know if you have any specific requests.
And there are unknown words (words that are not covered by the lexicon).
But the top-poster is right: the linked website does part of speech tagging, not parsing.
Providing a wide-coverage parser for the web is still hard. The number of possible parses for long sentences is enormous. Even if a sentence is not ambiguous to us, grammars allow all kinds of ambiguities.
There is a web demo for a Dutch system that is developed in our research group[1], but heap size and time limits are used, to exit gracefully if parsing takes too much time/memory (and sentences with more than 20 words are ignored, since you'll really want to do offline parsing).
Statistical methods do quite a good job of pushing that ambiguity back under the rug (see link to Berkeley parser further up, or the work on statistical disambiguation in Alpino).
As to the time consumption and complexity, that's a known problem of unification grammars (or just any grammar that does a little bit more) - but see this paper by Matsuzaki et al for efficient techniques to speed this up: http://www-tsujii.is.s.u-tokyo.ac.jp/~matuzaki/paper/ijcai20...
True, but you still have to build up a forest from which every parse can be extracted (as Alpino does). Of course, it does reduce the cost of ambiguity. E.g., I implemented packing in the Alpino chart generator, and there are very many edges with the same 'semantics' that can be packed, especially since Alpino allows for a lot of optional punctuation, and there are often roots plus subcategorization frames that give multiple inflections. With a beam search, unpacking is a lot more efficient.
There are also many other possible optimizations. E.g. with a left-corner parsers you can exclude left-corner spines that will probably not lead to a probable parse (Van Noord, Learning Efficient Parsing, 2009, and other work). And, of course, reduction of lexical ambiguity by restricting frames using a part of speech tagger.
Still, even in this best-1, optimized scenario, real-time parsing of long sentences is still hard. So, when parsing large corpora we usually apply time and space limits (which is easy to do in Sicstus Prolog, with good recovery).
Nope. Parts of speech of words are ambiguous and you need some statistical method of disambiguation. Plus, you need something to guess the possible tags for unknown words (e.g. speakers of English are able to guess that 'supercalifragilistiexpialigetic' is an adjective even when they see it for the first time because they can generalize from other words with -etic.
Still, I wouldn't call part-of-speech tagging parsing.
This is exactly the sort od thing that makes a trip to HN worthwhile! Very cool, Thanks to high-school latin I was able to make a bit of sense of the table of parts of speech. Shame I haven't got a use for it ATM.
Darn, now I'm tempted to write a pure javascript POS tagger just to show that you don't really need anything server-side (or maybe just little bits here and there so the web page doesn't need to load a 20MB model right away - the computational effort, in any case, is not so bad that you couldn't do it in JS).
Hmm. Maybe sometime.
EDIT: The Stanford POS tagger is more complex and quite a bit slower than anything you'd do on your own. To quantify this, there's methods that are 10x as fast while sacrificing 0.05%-0.2% accuracy. (Or the easy ones that are 100x as fast, but are 1% less accurate - these would be fun to do in JS).
I don't have a JS version, but here's a functioning HMM tagger in under 300 lines of Python:
http://gist.github.com/503784
(model loading doesn't work yet for some reason, but you see what it's doing in principle).
This uses a smoothed trigram HMM, so it should, in principle, be a bit better than NLTK's HMM tagger but not as good as serious POS tagging packages (e.g. hunpos, or the Stanford POS tagger)
To me this looks like a classic case of where it makes sense to keep this kind of processing on the server and make it available as a service - what functional advantages would having a POS tagger locally give you (apart from offline access)?
Once the model is loaded, you can get around the ~200ms roundtrip latency that it takes to connect the server.
Obviously, there's lots of ways to hide these 200ms from the user.
Sure, it is possible, but what does it add? It's not as if a Hidden Markov Model tagger takes that many server resources. Or if you are really worried, you could build a unweighted finite state transducer (Roche, Schabes, 1995) ;).
The tagger is licensed under the GNU General Public License (v2 or later).
The GPL, like other licenses meeting the 'open source definition', has no restrictions on use -- only on proprietary distribution under nonfree licenses.
It doesn't seem to be handling multi-word proper nouns as I thought it would:
"Pride and Prejudice is a good book." becomes "Pride/NNP and/CC Prejudice/NNP is/VBZ a/DT good/JJ book/NN ./." I would have thought "Pride and Prejudice" would be lumped together.
It's a little confusing but this looks like a front-end to Stanford's Part-of-Speech tagger. POS taggers do not group multi-word tokens. This would be the role of a chunker or a parser.
I've been working on this website for the past couple of weeks. Please email me at andrew [AT] naturalparsing [DOT] com if you have any questions or suggestions, or want to use the API.
FYI, right now the API is not using all the capabilities of the Stanford Parser, just the word tagging part. More features will be implemented soon. Let me know if you have any specific requests.
Andrew