The BigDoor Blog

BigDoor API service has been restored

We’ve just recovered from an extremely painful and embarrassing outage of our API.  Because we love to embrace transparency, I figured I’d post the email that I sent to our customers earlier today.

——————

Update: BigDoor API is up!  Please resume your usage.

Please note that we are currently missing historical transaction detail and leaderboards.  All of the data has been restored, but due to its volume it will take a while for it to be loaded back into our production system.  Transaction detail will be loaded over the next couple days and leaderboards should become available during that same period of time.

All currency balances, endusers, levels, awards, goods, and all configuration data (defined transactions, currencies, etc.) are working and fully updated as of the time of the outage.  Please let us know right away if you see anything that doesn’t seem right – but we have thoroughly tested and scrutinized the database and believe all is good.

I want to take this chance to once again offer up the sincerest of apologies for this unbelievably painful outage.  There is absolutely no excuse for this kind of disruption in service, but I think we owe each of you an explanation of the root cause.

The outage was caused by a test script that was meant to refresh a test database when it was inadvertently run against our production systems.  This script truncated 79 gigabytes of data in a matter of seconds.  Truncate actions on InnoDB databases are meant to be final so they are incredibly difficult to be undone.  We have a master/slave database configuration but unfortunately our slave database dutifully truncated all of its data as well.  Then through an incredibly odd series of events, it turned out that our last good database snapshot was from 7/21 and through another strange series of events we were missing 16 days of binary logs – meaning we weren’t able to just reload our old snapshot and replay the bin logs.

This left us in a situation where we could either attempt to recover a database with missing data, or take on the nearly impossible task of recovering a 79 gb database with 65 highly relational tables post truncate.  We quickly began pursuing multiple tracks each holding varying degrees of badness.  We engaged Percona and Blue Gecko (two of the leading experts in data recovery) to assist us and large numbers of each of their teams have been working non-stop alongside our BigDoor team to recover the lost data.

The resulting chaos is encapsulated in the series of email updates below where I continued to provide overly optimistic timeline estimates.  My apologies for sucking so bad at that – each time we made a guess at a timeframe it was done with sincerity and what we believed was a conservative approach.  Obviously I was wrong in almost every case.

We will be diving into the postmortem of this outage immediately, but for now suffice it to say that we sucked badly and we feel like fucking idiots because of it.  The good news is that we have some of the smartest and most dedicated people I’ve ever had the pleasure of knowing here at BigDoor, so we will take the appropriate steps to be sure this doesn’t happen again.  I can’t guarantee that we won’t ever screw up again, but I can promise that we will have a much more effective net in place going forward to mitigate issues such as this.  We’ll get better, I promise.

Thank you for your patience and support through this ordeal.  Please contact me if you have any questions.

Keith

blog comments powered by Disqus

beta! beta! beta!

Want to join the beta launch of the BigDoor Engagement Economy? We will contact you when this major platform update is ready. (We double pinkie-swear not to use your address for any other purpose.)

Email address

Talk to us

Want to talk about your project? Let us know how we can help.

Contact Us