Last Wednesday, our main data center in Dallas suffered a catastrophic power failure. While the inbound ExchangeDefender service went on as expected without skipping a beat, the less redundant services didn’t fare so well – Exchange 2010 was out for about 4 hours, Exchange 2007 for about 6 and various other services between 3 – 12 hours.
At this point it’s Tuesday and I’ve been pulling double shifts since last Wednesday evening working with partners, our partners clients, our vendors and everyone in between because I’ve taken this issue quite personally.
I’ve spent nearly my entire adult life building a reliable email business. Call me crazy, but I expect it to be up 100% of the time. That’s what it was designed to do, that’s what it’s built for and that’s how we manage and scale it. This isn’t some sort of a thing where a startup cuts costs here and there and hopes nobody notices – this is a major product in it’s 7th revision and some of the newer stuff (LiveArchive, outbound routing, apps – web sharing, encryption, etc) didn’t respond the way I had expected. So I’m fixing it.
We deal with crap every day. Power outages happen a lot more often than you think – not big catastrophic ones but isolated ones – blown power supplies, malfunctioning UPS and battery packs. Hard drives die far more often now than they did 10 years ago while the RAID cards and the amount of data they manage are exponentially higher. It’s not an easy business but it’s a fulfilling business. I would rather have this job than anything else in the world.
Here are a few takeaways.
Positive
The data center staff did an amazing job, in as short of a time span as they did.
I have by far the best partners on earth. Honestly, the feedback from you guys during this episode is what’s been keeping us awake.
Redbull & Monster Energy. Personally, Pirelli tires, Ducati and Aprilia.
The few issues that became apparent during this experience are going to be fixed within the 30 days and then we get back to the domination with the features.
Personally, learned a lot from our partners and just how well our service is received out there – it’s far more positive than even I thought but then again, people always bring me problems so I definitely had a wrong impression. Definitely makes me want to work harder.
Negative
Assholes. We all have asshole clients but you’d think people would be smarter than to try to kick someone while they are down and while they are trying to help them.
Irony. This was caused by a power failure in a piece of equipment that is supposed to switch the power from the utility to power generators.
Two Big Lessons: Shedding and Perspective
Shedding is good. This is particularly true for me as well as for many of you that have been in touch with me – in the grand scheme of things, a few hours is not a catastrophe – not to marginalize it at all but let’s face it, typical hardware outages last far longer. Compared to other big cloud services that are riddled with privacy concerns, questionable financing/management, days worth of outages and eventual data loss, for the most part all this did was reinforce just how important redundancy and failover and proper training are. Yet, it seems that the hardest hit folks are micro clients with a few seats here and there whose businesses apparently barely made it through the few hours without email. Here are some comments:
“Frankly I don’t want a client that is ready to jump ship on one outage, just had to share.“
“Ray of sunshine: Lost a 3 seat client that has been on my to-fire list for months.”
Perspective is good. Every single day I have conversations with partners who are scared of the Microsoft/Amazon/Google Apps business model. They don’t take it too kindly when I tell them to position the comparable products against it and if you lose to Microsoft or Amazon you probably don’t want that type of a client.
I’ll let you imagine the fireball response I get to that one.
But here is the perspective. If you Google for the kinds of outages and downtimes and other horror stories you get with Microsoft/Amazon/Google, you’d be insane to accept that kind of a compromise. But there are people that will – and you really don’t want them as your clients, trust me.
The initial reaction to any outage is – what happened? can we switch to something more reliable? I won’t lie, I thought the same thing last Wednesday until I realized that the reason we based our core operations in Dallas is because this is by far the best data center in the world. And while the initial reaction to downtime is always going to be tough, since Wednesday the feedback has been good and with the changes we are making our partners will be more successful.
Some will leave. That’s inevitable. And I’ve even been forwarded some folks celebrating the event on the newsgroups. I understand, enjoy it.
But what really matters at the end of the day, the big picture, the perspective – is that a whole lot of stuff rides on email and that this is a great business to be in. While the demand for the cheaper more compromised cut down product will be there and will be appealing to those that don’t know the risks, more often than not, people will choose a premium solution – which is good for us and good for our partners. You have our ongoing commitment to make the most scalable and most reliable offering out there and I look forward to bringing it to you.
…
P.S. Since last Wednesday I have been working with partners, partners clients and I’m pretty sure that I’m getting an ear blister from being on the phone all day and night. To all those of you who have spoken to me and those that have sent encouraging emails, I can’t tell you how much it means to me. Everyone from our biggest partners to the smallest partners to even the competitors that have gone through this – I appreciate the kind words and keep on forwarding them to my team. Absolutely everyone here cares about this stuff and what we work on every day. My message inside my company is that the bits and pieces of what we do are inconsequential to you – it’s the service that matters and whenever we make our partners win, we win. There really are times when I wish I didn’t care – wish I could shut down my laptop and let my management just deal with the problems. But my management, their staff and everyone involved has for better or worse sold themselves to you as your data center backoffice and we don’t quit.
To everyone that faced any bit of inconvenience as a result of all this – I am truly sorry. As you can tell from this blog post, I know how it feels. Stay strong, stay focused and remember that this is the difference. Most people in tough situations quit, switch, look at the greener grass on the other side and.. well, eventually you come to that sad realization that the only consistent thing in all your failures is you. The alternative is to just work harder – turn those negatives into positives, learn from the mistakes and show that work ethic trumps any inconveniences and “shit happens” moments that are just a part of life.
Here is the comment from one of my partners that literally had me smiling for hours this weekend. His client complained about the outage and the ABP muscle flexed:
Client: “Dude, WTF, it’s been two hours!”
Partner: “Yeah, and remember that $6 thing you wanted me to try and beat because you thought our stuff was too expensive? Well, if you think you’re crippled now what do you think will happen when your production system collapses without a managed backup or you finally get that audit?”
The pimp turned around an outage complaint into a $16,000 reoccurring monthly managed services deal. My response: “Sounds like you just earned your Ferrari payment!”
ABP.