Sure, I'm going to be writing a longer blog post on how I made it, but for now here's a short summary:
I made a script that scrapes all the links from hacker news every 15 minutes. I then open the links and process the text using python's nltk package (deciding what words are important and useful). Then I used a suffix tree in a mongodb backend to store the important words in such a way that once it looks up a word you can get the set of documents pertaining to the word. This way the search is linear in the length of the query and not the number of documents. The rest was just some jquery ajax calls and parsing of the search query.
I'll look into a new design, maybe make the orange's white and the white's orange.
Great and snappy! I would really love the longer blog post...! Two questions if you have a minute: 1) Why suffix trees and not suffix arrays? 2) How are you implementing them? Did you do the tree building yourself or is there a good library that you recommend? Thanks.
I used a suffix tree over a suffix array because I hadn't heard of suffix arrays, but after glancing at the wikipedia page for suffix arrays it seems those might have been a good choice too. I'll look more into it. I did all the tree building myself, and I'll explain that in my post. The post should be ready by tomorrow.
For this type of thing I've had much better results with LDA in the python package gensim. It is less prone to mis-matches based on similar keywords (since it is context based) the problem with LDA is that for it to be most effective you need to have a taxonomy available for the documents, but you might be able to build a corpus or two out of sites like stack overflow.
BTW, the "full orange" strains my eye. Maybe it's just me, but it would be nice if you could have softer colours!