Greplin: 1.5 Billion Documents Indexed, Six Engineers

pw · on April 28, 2011

Makes me think of this Quora question: Is it true that size of the portion of the web that Google indexes is actually smaller than sum of sizes of the contents of everyone's Gmail? (http://www.quora.com/Is-it-true-that-size-of-the-portion-of-...)

Is it fair to say that the size of the "private" web (what Greplin aims to index) is, in aggregate, larger than the public web? And are there any amazing things that become possible once you've indexed a large portion of that private web?

aik · on April 28, 2011

Good question. With my understanding, being that Greplin has access to all private information, they have an incredible amount of power -- in fact more than Google who just has one/two aspect(s) (albeit large ones). Just for the sake of privacy, imagine Greplin agreeing to giving up private user information to the gov't, just like all these other companies? They'd have access to everything.

Scares me a bit too much to sign up for the convenience.

alnayyir · on April 28, 2011

Want one you can deploy privately?

aik · on April 28, 2011

Yes very much so.

jamii · on April 28, 2011

Ahem - https://github.com/quartzjer/Locker

alnayyir · on May 1, 2011

This gets mentally filed under the same category as, "you should be checking to see if a library does it for you before you start coding".

Tibbes · on April 28, 2011

I'm guessing that: a "document" on Twitter is a single tweet; a "document" on Facebook is a wall-post or equivalent; a "document" on GMail is an e-mail; a "document" on Google Calendar is an appointment.

Therefore, the comparison with Google’s web-wide index in 2001 is a little misleading (in terms of the amount of data), given that the average size of a web-page is greater than all of these.

Of course average size of a file on Dropbox is likely to be larger than a webpage. I wonder what percentage of those 1.5 billion documents are files on Dropbox.

tsycho · on April 28, 2011

greplin doesn't index content within files on dropbox, just the filenames.

I am building a startup that does that i.e. it indexes your doc/pdf files (more formats coming), and allow you to instantly search through them. It's called grepfiles.com, but is in very early stage (pre-alpha), so go easy on it since I am not sure how well it scales. Mail me at mail@asif.in if you have any feedback. Would really appreciate it.

rakkhi · on April 28, 2011

What does the HN community think of the greplin concept? They have recently added a Chrome plugin and a greplin search replacement for standard email search.

Think it is a public beta and anyone can signup if not ping me and I'll send you an invite.

My main concerns with the service are: + Centralized risk - keys to very valuable kingdom + No two factor - but they tell me its coming + No word on whether they encrypt in storage - although it should only be an index to the information rather than the actual info + Standard SAAS / Cloud risks - internal abuse, legal turnover etc.

Any others? All of these could be mitigated to a reasonable degree. What do you think? Is there a future for this type of service (or big buyout for Google / Bing) or is it just too scary?

oscilloscope · on April 28, 2011

It's an incredible amount of personal data. If all that data was collected, then abused, I'd dissociate from much of my identity. I would just feel totally alienated by post-industrial society.

I'd be okay using Greplin if I knew Google was going to acquire them. I trust Google. I figure when Google goes bad, there will be much bigger issues facing humanity and our internet pasts will all be damning anyways.

jwr · on April 28, 2011

> I'd be okay using Greplin if I knew Google was going to acquire them. I trust Google. I figure when Google goes bad, there will be much bigger issues facing humanity and our internet pasts will all be damning anyways.

I think it is a sign of the times that I read this paragraph, though it was a subtle joke, reread it and decided it was serious, and then did some more thinking about whether the author is serious or not.

webmonkeyuk · on April 28, 2011

1.5B docs by just six people is impressive but I suspect that computers did a bunch of the indexing work.

itgoon · on April 28, 2011

Ha!

Just the logistics of _handling_ 1.5B docs would keep six people pretty damn busy.

aik · on April 28, 2011

I would get seriously excited about this if I could install it on my own server and keep my own index. I'm a bit hesitant giving them access to all my data to all my accounts, in exchange for a small convenience.

It is pretty impressive, though saying that it launched in February is misleading. I signed up last year, ran into a bunch of problems with it not indexing anything, and haven't opened it since. Now it looks like everything actually has been indexed, which is cool. I'm deleting my account for now though, as it doesn't yet seem easy enough to be useful for my purposes.

lehmannro · on April 28, 2011

I regularly hear that if I could install it on my own server argument and wonder if you think you can handle security and administration much better than someone who's paid to do it. I, for one, can't and would not want to waste my time on it.

aik · on April 28, 2011

I agree that is a good point. Perhaps it is a technology that would be better off not existing?

Security aside, one of the fears I have isn't necessarily against hackers, but against legal entities making use of the private information illegally, in addition to Greplin selling "me" in a very compact and precises manner to whoever they want.

martinkallstrom · on April 28, 2011

Even if you trust that your own computer or server is more secure than Greplin's servers, your communication with others will be indexed on instances belonging to the people you communicate with as well.

So the question even for the seasoned computer security expert that want to use a distributed Greplin variant is: Do you trust your friends and colleagues to have better security on their home or work computers than Greplin can achieve with dedicated work?

With a distributed system it would still be a non-trivial task to protect against a dedicated worm or trojan that infest the network and traces paths to other Greplin users after stealing all the data from each instance.

Since the data is social and each document in many cases concerns more than one person, it might actually be a less complicated task to achieve sufficient security in a central location.

mgkimsal · on April 28, 2011

There's security breaches all the time on systems that are set up and maintained by people who are 'paid to do it'. I don't think that's a the best possible signal. People who are able to install complex software (something beyond wordpress-ease of install) are possibly more capable than many 'paid to admin server' admins. Not all, of course, but being paid to do something doesn't indicate 100% competence.

Likewise, someone installing something on their own machines for privacy concerns can be said to have more vested in keeping things secure than the person who's only doing it for their job, maintaining a server with thousands of bits of data on it.

g123g · on April 28, 2011

Big Deal?

With cloud providers like Amazon providing computing power on the pay as you go basis I am not sure why this is a news now days.

Some ridiculous comparisons are thrown about in the article -

same size as Google’s web-wide index in 2001

60 times the size of Google’s original 1998 index

I am not sure how to process and make sense of these comparisons.

mlinsey · on April 28, 2011

It's a big deal because:

(a) It's a proxy for traction. Greplin indexes data that can't be crawled; users have to authorize it to index their data. So aside from how hard of an engineering feat it is, the fact that they've indexed this much data probably means that they have a sizable number of users.

(b) While you're right that the technical challenge of indexing that many documents is easier now than in 2001 thanks to things like AWS (and numerous open source projects), to do it with a team of six is still impressive.

g123g · on April 28, 2011

My gripe was with the article and not with what greplin is doing. Details like what you have mentioned in point 1 would have made the article much more useful rather than multiplying some random number from 1998 and expecting the readers to have a wow moment. Some idea about how they actually index the items, how they store this massive data, how the search is done to keep it fast etc. is what I would have liked to read.

catshirt · on April 28, 2011

this is a press release to techcrunch

they have an engineering blog as well http://tech.blog.greplin.com/

dacort · on April 28, 2011

While I understand that real-time full-text indexing is a much more difficult problem to solve, I've got just under 1.5 billion tweets "indexed" in TweetStats. And I'm one person.

Granted, given the 30MM/day number they must be growing that index very quickly and they've likely hit that 1.5 mark pretty darn quickly.

moe · on April 28, 2011

real-time full-text indexing much more difficult problem to solve

Solve?

Greplin has probably not built their own search technology. I'd guess they're simply running Lucene or Sphinx like everyone else.

Their index is still small by search standards, as you can tell from TechCrunch having to reach 10 years back to make an "impressive" analogy.

Today, 1.5 billion documents translates to a couple terabytes of data (probably high single digit). 30 million indexed/day translates to about ~400/sec. You could store and process all that on a single, beefy box. Or you can spread it out over a couple amazon instances.

But yes, in 2001 this would have been impressive. In 2001 you'd pay $150 for a 40 GB harddrive...

ww520 · on April 28, 2011

While Greplin is impressive, it's not in the same scale as Google, even in its early day. Google built an large global index for everyone, while Greplin built many small indices for many users. Some calculation would illustrate the point.

Google's global index: 1 billion documents. Searchable by 1 million users. Need to support 1B x 1M search capacity.

Greplin's individual indices: 1000 documents/user for each individual index. With 1 million users, there are 1B documents total. Each user only searches his 1K index. Only need to support 1K x 1M search capacity.

It's orders of magnitude difference.

B-Scan · on April 28, 2011

250M of documents per engineer. Not bad at all.

gubatron · on April 28, 2011

I wonder if they use Solr to distribute their ever growing index.

sigil · on April 28, 2011

Lucene, I believe.

http://news.ycombinator.com/item?id=2443675

lennexz · on April 28, 2011

at 19 this young man is already doing big. I havnt tried greplin yet but I think it has a very bright future

mindotus · on April 28, 2011

Agreed and very impressive indeed.

I'm sure we'll be hearing much more from these guys.