We have a great product, awesome, supportive customers, and an incredible team. As difficult as the last week has been, these things still hold. In fact, as a result of last week, our product offering has become more bullet-proof, we’ve discovered how incredibly supportive and empathetic our customers are, and our small team of developers has joined a tiny worldwide population of geeks well-versed in MySQL InnoDB data recovery.
Admittedly, over the last few months we’ve moved too quickly, accepting higher risk in exchange for greater innovation and forward strides with our product. We’ve worked hard to push the maximum amount of changes and value for as many customers possible. In fact, our release manifests show that over the last four months, we’ve pushed out 20 releases of our product totaling close to 400 discrete enhancements or fixes. Those changes were performed by a team of two feature developers and a single developer solely dedicated to automated system testing – and they’ve done an incredible job of it.
But increasing velocity can increase risk, and conversely reducing risk often involves slowing down. Slowing down means spending more precious startup time, which there isn’t an endless supply of. On one end of the continuum there’s a failed mess rushed to market, and on the other a near-perfect half-baked nada. We’re there in the middle, trying to balance the risks and benefits.
One way to move quickly with a small team is to give it full access to all server resources. We didn’t do that – one person has been responsible for building out and maintaining the back-end systems, 47 servers at the moment. Complete access to the production systems is limited, servers are firewalled, and accounts are well locked-down. But like any small and agile IT team, every person in the group still has enough access to wreak havoc. That’s what happened last week.
We’ve developed a suite of automated system tests that amongst other things verify the integrity of our API to ensure that we don’t make any inadvertent backwards-incompatible changes, measure performance by entry-point, and detect unexpected increases in SQL query counts. These tests require removal of pre-existing data from the database, which is scripted and part of the entire process. There are two hosts that we run these tests against, but a new employee inadvertently ran them against our production database last Monday night (ironically, apparently national Roller Coaster Day), causing all data to disappear on both the primary and backup slave server. In addition, through a strange sequence of events our only good backup would have left us with around 8 days of missing data. Once reality set in, I went through a series of emotions and reactions that can’t really be described and I’d prefer never to experience again.
After realizing the severity we began calling on external expertise for help. Many sleepless hours and days later, we successfully recovered the truncated data with the help of two other external teams, one including the author of the only tool that can actually do this kind of arcane recovery.
Yes, this employee is still with us, and here’s why: when exceptions like this occur, what’s important is how we react to the crisis, accountability, and how hard we drive to quickly resolve things in the best way possible for our customers. I’m incredibly impressed with how this individual reacted throughout, and my theory is that they’ll become one of our legendary stars in years to come.
I was even more impressed with how the entire tech team became hell-bent on fixing things. We couldn’t have slept less. This sentiment was later echoed by a number of the other residents at our open floor-plan office (Founder’s Co-Op headquarters). Both Keith and I heard remarks about what a great team we have, many coming from other co-founders.
With that, there are a number of folks that deserve thanks and recognition for going above and beyond to help us through this colossal recovery last week:
- The BigDoor team
- Collin Watson, diver-in extraordinaire and speaker of tongues
- Lee McFadden, API entry-point and big fan of NoSQL technologies
- Brian Oldfield, insomniac automator
- Harley Holt, unflappable fire-eater
- Keith Smith, megaphone and seer of forests
- Ring Nishioka, Roy Schmidt and the rest of the BigDoor team
- The BlueGecko team : Sarah Novotny, Mike Hamrick, Patrick Galbraith and Jonathan Nicol, thank you for your on-call DBA and operational expertise
- The Percona team : Aleksandr Kuzminsky and Baron Schwartz, thank you for your on-call data recovery expertise
- Geoffrey Nuval from DevHub/EvoMedia, thank you for being a great customer, partner and for your continued confidence and support
- Jim Banister from SpectrumDNA, thanks for being a great customer and for your confidence and words of support
- Elvis, for reasons that are self-evident.
I hope that this small bit of color helps to convey how seriously we take our customers and our uptime here at BigDoor.
There are a number of things we’re doing to prevent this from happening again. It should have never happened to begin with, and as the CTO it’s my responsibility to make sure we take the time to implement proper safeguards. For that failure, my most sincere apologies go out to our customers and theirs, as well as to our board and investors. We’ll make it up to you, in the coming releases of our kick-ass platform. In fact, the same day that we called the recovery complete, the team worked late into the night in order to do a feature release (the Badge-O-matic , amongst other things) – so we’re already delivering on that promise.
–Jeff