Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Gmail's Second Major Problem (kapilkale.com)
61 points by kapilkale on Jan 24, 2013 | hide | past | favorite | 72 comments


The issue that you're complaining about was due to the incident linked at [1].

I can't go into the technical details as to why this happened, but I can roughly explain that it was due to the CAP theorem, essentially "Consistency, Availabilty, Partition Tolerance. Choose two." [2]

Furthermore, you have to choose partition tolerance [3]. The delivery delays that were seen yesterday is because we choose consistency over availability in our systems.

In fact, most of the outages I see people complain about on Hacker News related to Gmail are because we won't sacrifice consistency of user accounts. It's a different problem than huge scale serving of web search indexes or facebook timelines because in those cases if you're missing a few entries most people won't notice or care. When you're searching for an email, you know what email you expect to find and you'll get angry if it isn't there.

Users won't stand for an email showing up one day, disappearing next hour, and then coming back later (which is what could happen in some designs for eventual consistency when serving from different datacenters).

Thus, Gmail availability is lower sometimes because we make sure that all of your data is there all the time. We're insane about it, and we have huge jobs that run constantly on our systems to ensure that we're even resilient to bad hardware. With those we regularly find single bit errors and bad CPUs.

So, as a Gmail engineer, I'm sorry that there were delivery delays yesterday, and all I can say is that every time these happen we tweak and redesign our systems to make these more rare and to improve Gmail's uptime. We'll never have the snappy response and perfect uptime[4] of a computer under your desk. But at the same time a hurricane could take our one of our datacenters and we won't lose your data.

-Andrew, a Gmail Engineer.

1. http://www.google.com/appsstatus#hl=en&v=issue&ts=13... 2. http://en.wikipedia.org/wiki/CAP_theorem 3. http://codahale.com/you-cant-sacrifice-partition-tolerance/ 4. For some definitions of perfect.


Thanks Andrew- this is really helpful.


The author is furious about a technology that was not designed for instantaneous delivery (SMTP) and fumes about it, because it isn't instantaneous. What's worse is, he quickly assumes goes into some kind of 'super-hero mode' and makes a pretty heavy claim that this is Gmail's second biggest problem.

I think this is the best approach - If someone hands over you a free glass of wine, which they've tried their level best to make it perfect, you just drink it instead of trying to suddenly become a food critic and blame the person who gave you the beer. If you don't like it, don't drink it. Buy your own beer from somewhere else. As simple as that!

Tell you what, you should try Yahoo!, I bet.


Well, I'd at least complain that the wine somehow turned into beer...


Isn't the 115 minute lag in the screenshot lag between one Google SMTP server and another Google SMTP server? That seems like a problem Google should be able to address.


> That seems like a problem Google should be able to address.

This is part of the problem.

SMTP and the whole E-Mail thing is inherently unreliable and slow. Speaking for myself, but suspecting this is true for most heavy Gmail users, Gmail feels so good because it seems more reliable than using any other Web mail service. It is even more reliable when both ends use Gmail -- attachments are almost a non-issue within Gmail. Before Gmail mails could take ages to be delivered, get silently lost in spam filters/folders or did not arrive at all.

When we ask Google to be more reliable in that part, we ask them to not do E-Mail but proprietary Magic-Gmail-Messaging. This is part of the problem, not the solution, this is why Facebook is doing E-Mail now. (Ironically FB Messaging seems in some way even less reliable than the Gmail Messaging.)

BTW, what's up with this X.400 thing?


> If someone hands over you a free glass of wine, which they've tried their level best to make it perfect, you just drink it instead of trying to suddenly become a food critic

I think this is a bogus argument. If someone hands me a free glass of "wine" and it turns out to be undrinkable swill instead of wine, I'm pretty sure I would point this out.

With regards to email, people expect email to be fast, and if your email servers routinely take a long time to make a delivery, you are not giving people what they would expect from your service, free or not.

I say this as someone who was a sysadmin for a number of years and maintained email servers. If our email servers took two hours to deliver a piece of email, and it was the fault of our servers, I would have considered this to be an urgent issue to resolve.

All this being said, I'm pretty happy with Gmail myself.


Can you please justify how Gmail, an Email client used by probably millions of users suddenly turns out 'un-drinkable' to the author (and to you)?

Can you send E-mails? Yes.

Can you receive E-mails? Yes.

Can you forward E-mails? Yes.

(^Note: This is a paid feature on other popular clients)

Can you read all the E-mails you've sent and received? Yes.

Are you paying for it? No.

So, how does this become 'un-drinkable'? Your e-mail was delivered despite the delay. It would be fair to call it undrinkable if your email wasn't delivered. AFAIK No one promised you they would deliver it within 'x' minutes.

For heaven's sake, it's a bug. People make mistakes, Softwares contain bugs, I'm sure with all your experience put together, you know this better than me. Why make such a huge fuss out of this? How does such a minor problem affecting only a fraction of Gmail users suddenly become Gmail's Second biggest problem?


> Can you please justify how Gmail, an Email client used by probably millions of users suddenly turns out 'un-drinkable' to the author (and to you)?

If I routinely experienced delays of two hours with Gmail, I would find it undrinkable. For instance, I arrange to meet people using Gmail. Sometimes for important events. E.g., my girlfriend tells me at what time I should come over, and she often gives me little notice. Or we'll arrange when and where to me for a concert shortly before the event.

I've been using email for this purpose for my entire adult life, and it has almost always worked for this purpose just fine.

Since I don't typically experience the kind of delays mentioned by the OP, I am happy with Gmail. If I did experience them, I would be as unhappy with Gmail as the OP.


> How does such a minor problem affecting only a fraction of Gmail users suddenly become Gmail's Second biggest problem?

In my opinion the problem is this: the larger the mailbox or the more often you use Gmail, to higher are the odds of facing described problems. I.e. the heavy users have most problems. So there is the problem, it's just a matter of time until someone else solves it. When someone else solves it, it means those heavy users will migrate to this someone else. (Remember they were also the first who actually had any reasonable use-case in Gmail, that used to be advertised as Mailbox for messies/people with many mails.) The rest will just follow automatically, it's always like that. ;)


Also, is it your claim that if it frequently took two weeks for Gmail to deliver some emails, then that would be OK because Gmail is free? After all, even if it took two weeks, you could still send E-mails, receive E-mails, and forward E-mails.

For many purposes, two hours might as well be two weeks or two years.

Barring network outages and the like, people expect email to be delivered within two minutes, not two years, two weeks, or two hours.


No doubt SMTP is asynchronous, but a lag of 115 minutes? For a consumer that is way too high to live with, whatever say the standards.

But obviously the title of the article and probably the topic borders on a flame-bait.


I guess you weren't around for FIDOnet. It could be days before you got an email.

Plus we had to walk to school in the snow - uphill both ways.


Your problem is that you're relying on a method of communication that was never guaranteed to be instantaneous. Maybe you should investigate other forms of communication within your team besides (or in conjunction) with email.


Exactly what I was thinking. SMTP is generally store and forward. Your message gets thrown in a bin and the MTA gets to it when it can. I can't find anything in the actual SMTP RFC [1] indicating that there are limits to how long an email can stay in queue.

Google processes a lot of email. The primary constraint limiting how long their MTA backlogs can be is customer satisfaction. Most SMTP servers I've worked on are happy to retry for 72 hours or more.

1: https://tools.ietf.org/html/rfc821


100% agreed.

I try to treat email as any other letter I recieve(1). Meaning that the sender can only expect an instant reply if I know beforehand the mail is coming and urgent. The same way one would rapidly reply to an urgent letter one is expecting, but easily take a day or more to reply to, or even open, a non urgent letter.

If you need to get a hold of someone directly you should use your phone (to call them). If they don't pick up their phone, leave a voicemail or send an sms.

(1) Registration confirmations / new password emails are an obvious exception here.


...and of course even voicemail / SMS are not guaranteed to be seen anytime soon!

[If at all; I know plenty of people who simply never check their voicemail...]


Exactly my thoughts. Not to down play the problem because it definitely is one Google should at very least address but even if Gmail was up 100% of the time with no problems there should always be communication redundancy. Instant messaging, file hosting services, other email providers.


Sure, but email is still the preferred form of external communication with customers and clients.


Doesn't change the fact that this preferred form of communication by no means guarantees delivery in a certain timeframe. It's not a google problem. The email can be "stuck" in every place between you and the recipients mailbox.


to highlight this, I can think of an emails that got delivered YEARS after I hit send. This happened once in almost 20 years of emailing, yet in still happened. This probably occurred about 6 years ago.


I pay over $800 a month for Gmail (ok, for apps for business for many users), and we STILL have this issue, and while we can indeed call Ireland to get support, the answer is always, "We are having a delivery issue. Our engineers are working on it."

So paying doesn't help.


> So paying doesn't help.

Well, you can now say "Google knows about the issue" to your clients/bosses rather than "fuck if I know".


Whaaat, SMTP != Instant Delivery!? You must be joking!

As much as i can understand the pain of an email not being delivered after 2 hours this seems to be too much drama. "happens 2 times in a month" is not a "major problem" for a free service especially since i suppose this user is part of a rather small minority. I never noticed substantial delivery lags myself over the past years myself.

Anyway, if you say "i'd pay 50$/month" please email me, i'll be happy to provide you with a very overprized mail account with same-second delivery! :)


As funny as you think it might be to charge $50 a month for guaranteed quick delivery, this is exactly what he actually wants. Large random delays in delivery are costing him way more, so he's willing to pay to fix the pain. Instant delivery isn't in the RFC, but conformance with the spec is not something he cares about. He cares about fast delivery.


Just as you say: What he wants and what he has have nothing to do with each other. I want my pocket to contain 1 million dollar. It's not how it works, unfortunately :(

Look, there are many many reasons why an email may not arrive in time or at all at the destination. To say "i lose so much money but i rely on such an unreliable protocol" is just not the mistake of google.

And as i said, for 50$ dollar you can easily buy 2 virtual machines in two different locations, one domain and have virtually a no-downtime mail service for your own. Add some roundcube, clamav, spamassassin and there you go. Setup takes a bit, fine, but that's about it. I guess it wouldn't even take that long. It won't guarantee delivery as well, but mails won't be stuck in some MTA ;)


That's assuming your time is worth nothing. Also, there's much more to reliability than having servers in more than one location.


> He cares about fast delivery.

Then he can find a better solution than email. Make a phone call and upload a file via sftp to a remote server. That's how I do deployments here at Insurasoft™.


Looks like Greylisting to me: https://en.wikipedia.org/wiki/Greylisting


Yep. I used to work at DreamHost and our customers would run into issues with Gmail greylisting incoming mail. In fact, they do it like crazy.

In a lot of cases, adding an SPF record for the sending domain would clear things up.


It is disturbing how many DNS/domain hosting companies still have zero support for SPF.

It is A, MX, CNAME, and MAYBE a couple of other common record types. But a TXT record? No chance...

I had to move to Route53 just for this functionality.


I had seen massive delays in delivery to Gmail when sending newsletters at my last $dayjob, and I had always assumed it was part of their spam filtering. For example, if it were me and I was getting thousands of nearly-identical emails sent to Gmail addresses, it seems like a good idea to let 1% of them through to see if end users mark them as spam or not. If they all get marked as spam, I don't need to let the rest through.


Likewise. I used a script to recurse over a long list of names, and putting in a small sleep() delay in-between emails ended up fixing the problem immediately.


Are you hitting any of their receiving limits? http://support.google.com/a/bin/answer.py?hl=en&answer=1...

Because this is exactly what happens when you do.


Not even close.


i wish some major computer company would release an easy-to-use plug and play server product that would enable consumers to run their own mail servers on cheap hardware, just by toggling a switch

just imagine... no ads, no privacy concerns, complete control.

if they were smart they would include some sort of automatic backup program that runs in the background and saves everything every once in a while, a backup that could be restored with a click of a button... like going back in time ... like a .....


What would happen if I'm on vacation and my apartment loses power, or my ISP-provided DSL modem locks up, or my cat chews on the server's ethernet cable?

If the answer is "you might not receive some email", then that is not an acceptable alternative to hosted email.


Exactly. Or you forget to do backups again and lose 6 months of email...

For 99% of the population, centrally hosted, professionally maintained, email services are vastly superior.

Sure, there will always be those who prefer self-hosting, corporations, the sittin'-inna-tree-with-my-M16 crowd, etc., but it's not really the best option for most.


How come Zimbra seems to be so widely disparaged?

Spam control requires a little more intervention than with Gmail, but I've found it straightforward enough to set up once, then do nothing else for a couple years.


Because Zimbra only supports Linux, is designed for large enterprises, and requires special client-side programs to be installed just to do things like calendar sync.

It is fine if you control the whole ecosystem, but ironically in most scenarios where that is the case you're running Windows Active Directory anyway so therefore might have well pay a few pennies for Exchange CALs.


Straightforward? I'd love to know your secret.

Compile this odd package with this patch. Now these three more. Oh they don't compile? That's right, they haven't been maintained for 3 years. Etc, etc.


I created a VPS with CentOS. I think I started with CentOS 5.6 and Zimbra 6.8 Community Version. I last updated it about two years ago, so that it is now CentOS release 5.8 and Zimbra 7.1.1_GA_3196.RHEL5_64.

It intermittently was used by different groups of about 50 people who accessed it using IMAP and the web portal without any problem.

Now all it is doing is continuing to collect email subscriptions from software vendors I was trying out at the time, though some other people may still be using it. I don't think anyone else even knew how to access the administration console, and I haven't logged into the admin console in over a year.

Occasionally, log into the shell, because of a bug I never bothered to address. It slowly collects a lot of temporary files. Speaking of which I should do that now:

`Last login: Sat Sep 1 19:27:28 2012 from __`.

Then I ran:

    for i in {0..9}
    do
        find /tmp/jna${i}*.tmp -cmin +30 -exec rm {} \;
        echo $i "out of 9" `date`
    done
To be honest, that is ugly, and I should upgrade. The script took 30 minutes to run, and found 20GB that had accumulated in under 5 months. However, it isn't critical for anyone, and < 3hrs/year is a nice level of admin effort.

To answer your question: I'm sure not every module is functioning perfectly, and though it hasn't needed much maintenance over the past year, it hasn't been under much load either.


OS X server has that, though I've not used it so can't comment on how easy it is to set up. It seems like it is pretty easy though. The hardware isn't cheap though, of course.


I'm pretty happy with Kerio Connect (MS Exchange-compatible, so it does push email to iphones/etc.) -- if anyone wants this, we're technically signed up as resellers and can sell you software licensed to host on your own servers.

The web client sucks a bit compared to Google, but otherwise it's essentially Exchange, but vastly cheaper and easier to manage.


hMailServer - www.hmailserver.com

Cannot get much more "idiot proof" than that.


Even though I agree with people here saying that you can't expect 100% instantaneous delivery from SMTP, this incident was actually posted on the Google Apps status dashboard ( http://www.google.com/appsstatus#hl=en&v=issue&ts=13... ).

So it's not even like this is some common problem that happens all the time and Google is ignoring. No service is perfect and outages happen.


Ah- there IS a status page. Thank you.


My main gripe against GMail is the visual noise and clutter. There are way too many buttons and gradients and shadows now.

I remember in private beta, it's cleanlisness was heralded as the second coming, but now they seem to be adding features 90% of users don't need.

I've switched to Outlook.com and haven't looked back. Back to cleanliness and non-introsive buttons and popups.


I'd say the #1 problem is spam filtering, which for me has gotten worse and worse. Recently I have had PayPal payment notifications going into spam! It's a shame because Gmail used to have incredibly good spam filtering - good enough that I didn't check it, as I knew I could trust it - however now I don't trust it at all.


I'm not sure what you're doing (it certainly depends on what your legitimate email stream looks like), but for me, gmail's spam filters are still incredibly good, probably as close to perfect as I've ever seen.

It's impossible for such a thing to be 100% perfect, of course (even your own eyes will deceive you occasionally!), but gmail's filters are good enough that spam is not an issue for me any longer.


kapilkale, for the price you say you'd pay a month, you could get Google Apps for Business and spend that much per year. And get live support and other goodies.

Have you thought about this yet?


Having used both gmail as an individual and gmail via paid Google Apps for Business, I'm inclined to believe that neither has any advantage when it comes to mail delivery speed, either in terms of declared SLA nor real-world difference.

I've seen unexplained lags of the type mentioned here on both about equally. In some cases I've had someone who was on the very same Google Apps for business plan email me, and then send a forward of the original when I reported I had not seen the first mail 10 minutes later, after which I received the 2nd mail immediately and still didn't see the original mail until like 30+ minutes after that.

I never bothered to follow up with Google's support on apps for business with the lag issue because it didn't happen that often (at least not that I noticed) and also I view Internet email as an inherently laggy system (though in practice it is near instant most of the time), so maybe if you do complain they'll do something, but I don't think just switching to the Google Apps for Business plan is an immediate cure for this guy's problem.


Except for the fact that there's no improved performance as a result. You don't get preferential mail queue access, there aren't dedicated mail servers, etc.

That seems to be much more what he's asking for or expecting out of a paid service, not "here, take some money for what I don't currently pay for".


$50 -> $1000 is one order of magnitude.

Not "orders".

Orders of magnitude hyperbole needs to stop.

Why not write "some would pay $1000 a month" when quoting the Graham article. You didn't even show you have users willing to pay $5000 a month for gmail which would in be the "orders of magnitude" you wrote.


95% of non-spam email is delivered within 5 minutes of being sent. This number is made up for the purposes of argument, but I think it's fairly accurate. I've administered mail machines for many years.

There are all sorts of reasons why the other 5% doesn't zip along, and some of those reasons are persistent, some are fixable, and some of them are essentially never going to be tracked down. Does Gmail have an internal problem? Maybe, maybe not, but there's not enough data here to find out.

If you want instant communication, use a direct connection under your control. It's still not guaranteed but at least you'll see the progress or lack thereof.


I've had massive delivery delays every so often with gmail and they're a little irritating, sure, but quite frankly email makes no guarantees about delivery time. If you need guaranteed fast delivery maybe email isn't the answer.


Google's servers are incredibly complex. Probably too complex - the more complicated they make their infrastructure (datacenter failover, region failover, bla bla) the more unstable it seems to get.

Google uses a very bureaucratic code commit system that requires sign offs from different people. This process takes a long time, and devs can't move onto the next step until the previous step has been accepted [1]. While this system is awesome for catching the localized bugs (no buffer overflow is going to get past that kind of code review), there is a major tradeoff. A dev can only keep so much state in mind when building architecture. If he is only working on the problem once a week with large time gaps, is he not going to lose track of important pieces of the puzzle?

This is probably the age old problem - if you make something that is too clever for even the creator to fully understand, how are you possibly going to make sure it is bug free? The problem being some delay between Google servers hints at an inter-region datacenter problem. I wonder if anybody at Google even understands the entire failover and interlinked data center system completely?

[1] http://www.splinter.com.au/2012/12/26/behind-enemy-lines-goo...


That link seems to be dead.

I wish I could share with you the pictures of "the big picture" in which every piece of proprietary tech was given its own little circle on a whiteboard and then was connected to everything else which it uses or which uses it.

To say it was huge was an understatement.


Er, except Google's services aren't unstable.

That doesn't mean they're completely problem-free, but nothing is.

[and given the importance of services like gmail to vast swathes of the population, I certainly hope they require many sign-offs for code commits!]


Never had any latency issues or lag in delivery. I wonder if this is some sort of regional issue.


I experience this same issue a couple of times a month as well. I happen to have a few plugins that I use in GMAIL (Xobni/Tout/Base etc) so I assumed that might be part of the problem? Are you running Gmail clean or do you have a similar situation?


Running clean.


Gmail seems pretty good to me at least. The worst trouble I've had is a few seconds of delay to receive an email without a refresh. Usually my phone and tablet get it first.


Delivery is usually instantaneous for me, but on occasion I won't get email for 6+ hours. It's a lot easier for me to notice the delay when communicating with co-workers, as the sender often wants to know why I haven't responded to the mail he sent 4 hours ago.


try using Yahoo, Gmail will suddenly feel like a supersonic jet


Anecdotally, I wonder if the reason I don't see this as much as I used to is because the quick email discussions I have happen almost entirely between Gmail users.


The solution is cheap and simple: FAX. It starts coming out on the other end before you're finished putting it in!


I've been using Zoho mail myself for a while, and it's worked out fairly well.


Something else to lament: their complete lack of support for webhooks.


Try Yahoo or Hotmail(Outlook now), then you may get the answer.


"Take my money. I’d pay $50 / month to get reliable service; others would be willing to pay orders of magnitude more."

Should be "order of magnitude more".

Getting really tired of tech writers using "orders of magnitude" hyperbole when its not really the case.

----

Also don't like him complaining on not receiving an "urgent" email in time. Urgent communications require phone calls.


Well, there are many reasons you could need an urgent email for something that wouldn't be satisfied over the phone. Sending a contract or a statement of work, for example. Sure it could be faxed, if both parties have fax. It could be put on Dropbox, if both parties have Dropbox, etc.

If it's not just a communication but rather an exchange of data, a phone call won't suffice. The author even mentioned that he had called the person, who resent the email 4 times. Obviously a phone call isn't what needed to happen. There's just a lack of good file transfer solutions on the Internet.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: