Is it fair to say that the size of the "private" web (what Greplin aims to index) is, in aggregate, larger than the public web? And are there any amazing things that become possible once you've indexed a large portion of that private web?
Good question. With my understanding, being that Greplin has access to all private information, they have an incredible amount of power -- in fact more than Google who just has one/two aspect(s) (albeit large ones). Just for the sake of privacy, imagine Greplin agreeing to giving up private user information to the gov't, just like all these other companies? They'd have access to everything.
Scares me a bit too much to sign up for the convenience.
I'm guessing that: a "document" on Twitter is a single tweet;
a "document" on Facebook is a wall-post or equivalent;
a "document" on GMail is an e-mail;
a "document" on Google Calendar is an appointment.
Therefore, the comparison with Google’s web-wide index in 2001 is a little misleading (in terms of the amount of data), given that the average size of a web-page is greater than all of these.
Of course average size of a file on Dropbox is likely to be larger than a webpage. I wonder what percentage of those 1.5 billion documents are files on Dropbox.
greplin doesn't index content within files on dropbox, just the filenames.
I am building a startup that does that i.e. it indexes your doc/pdf files (more formats coming), and allow you to instantly search through them. It's called grepfiles.com, but is in very early stage (pre-alpha), so go easy on it since I am not sure how well it scales. Mail me at mail@asif.in if you have any feedback. Would really appreciate it.
What does the HN community think of the greplin concept? They have recently added a Chrome plugin and a greplin search replacement for standard email search.
Think it is a public beta and anyone can signup if not ping me and I'll send you an invite.
My main concerns with the service are:
+ Centralized risk - keys to very valuable kingdom
+ No two factor - but they tell me its coming
+ No word on whether they encrypt in storage - although it should only be an index to the information rather than the actual info
+ Standard SAAS / Cloud risks - internal abuse, legal turnover etc.
Any others? All of these could be mitigated to a reasonable degree. What do you think? Is there a future for this type of service (or big buyout for Google / Bing) or is it just too scary?
It's an incredible amount of personal data. If all that data was collected, then abused, I'd dissociate from much of my identity. I would just feel totally alienated by post-industrial society.
I'd be okay using Greplin if I knew Google was going to acquire them. I trust Google. I figure when Google goes bad, there will be much bigger issues facing humanity and our internet pasts will all be damning anyways.
> I'd be okay using Greplin if I knew Google was going to acquire them. I trust Google. I figure when Google goes bad, there will be much bigger issues facing humanity and our internet pasts will all be damning anyways.
I think it is a sign of the times that I read this paragraph, though it was a subtle joke, reread it and decided it was serious, and then did some more thinking about whether the author is serious or not.
I would get seriously excited about this if I could install it on my own server and keep my own index. I'm a bit hesitant giving them access to all my data to all my accounts, in exchange for a small convenience.
It is pretty impressive, though saying that it launched in February is misleading. I signed up last year, ran into a bunch of problems with it not indexing anything, and haven't opened it since. Now it looks like everything actually has been indexed, which is cool. I'm deleting my account for now though, as it doesn't yet seem easy enough to be useful for my purposes.
I regularly hear that if I could install it on my own server argument and wonder if you think you can handle security and administration much better than someone who's paid to do it. I, for one, can't and would not want to waste my time on it.
I agree that is a good point. Perhaps it is a technology that would be better off not existing?
Security aside, one of the fears I have isn't necessarily against hackers, but against legal entities making use of the private information illegally, in addition to Greplin selling "me" in a very compact and precises manner to whoever they want.
Even if you trust that your own computer or server is more secure than Greplin's servers, your communication with others will be indexed on instances belonging to the people you communicate with as well.
So the question even for the seasoned computer security expert that want to use a distributed Greplin variant is: Do you trust your friends and colleagues to have better security on their home or work computers than Greplin can achieve with dedicated work?
With a distributed system it would still be a non-trivial task to protect against a dedicated worm or trojan that infest the network and traces paths to other Greplin users after stealing all the data from each instance.
Since the data is social and each document in many cases concerns more than one person, it might actually be a less complicated task to achieve sufficient security in a central location.
There's security breaches all the time on systems that are set up and maintained by people who are 'paid to do it'. I don't think that's a the best possible signal. People who are able to install complex software (something beyond wordpress-ease of install) are possibly more capable than many 'paid to admin server' admins. Not all, of course, but being paid to do something doesn't indicate 100% competence.
Likewise, someone installing something on their own machines for privacy concerns can be said to have more vested in keeping things secure than the person who's only doing it for their job, maintaining a server with thousands of bits of data on it.
(a) It's a proxy for traction. Greplin indexes data that can't be crawled; users have to authorize it to index their data. So aside from how hard of an engineering feat it is, the fact that they've indexed this much data probably means that they have a sizable number of users.
(b) While you're right that the technical challenge of indexing that many documents is easier now than in 2001 thanks to things like AWS (and numerous open source projects), to do it with a team of six is still impressive.
My gripe was with the article and not with what greplin is doing. Details like what you have mentioned in point 1 would have made the article much more useful rather than multiplying some random number from 1998 and expecting the readers to have a wow moment. Some idea about how they actually index the items, how they store this massive data, how the search is done to keep it fast etc. is what I would have liked to read.
While I understand that real-time full-text indexing is a much more difficult problem to solve, I've got just under 1.5 billion tweets "indexed" in TweetStats. And I'm one person.
Granted, given the 30MM/day number they must be growing that index very quickly and they've likely hit that 1.5 mark pretty darn quickly.
real-time full-text indexing much more difficult problem to solve
Solve?
Greplin has probably not built their own search technology. I'd guess they're simply running Lucene or Sphinx like everyone else.
Their index is still small by search standards, as you can tell from TechCrunch having to reach 10 years back to make an "impressive" analogy.
Today, 1.5 billion documents translates to a couple terabytes of data (probably high single digit). 30 million indexed/day translates to about ~400/sec. You could store and process all that on a single, beefy box. Or you can spread it out over a couple amazon instances.
But yes, in 2001 this would have been impressive. In 2001 you'd pay $150 for a 40 GB harddrive...
While Greplin is impressive, it's not in the same scale as Google, even in its early day. Google built an large global index for everyone, while Greplin built many small indices for many users. Some calculation would illustrate the point.
Google's global index: 1 billion documents. Searchable by 1 million users. Need to support 1B x 1M search capacity.
Greplin's individual indices: 1000 documents/user for each individual index. With 1 million users, there are 1B documents total. Each user only searches his 1K index. Only need to support 1K x 1M search capacity.
Is it fair to say that the size of the "private" web (what Greplin aims to index) is, in aggregate, larger than the public web? And are there any amazing things that become possible once you've indexed a large portion of that private web?