GuildPortal Dev Blog

Updates from Aaron Lewis, GuildPortal Code Monkey

Posted 1/4/2013 4:14 PM by Aaron Lewis. 1186906 reads. Share:

From 1/1/2013 to 1/3/2013 at 10:14 AM (Mountain), GuildPortal services went offline. Here's what went down, in sequence. The cause of the problem had its start on 12/16/2012, so I'll begin there:

12/16/2012 to 12/31/2013: The server backups, which are stored on a network share, failed to execute for two weeks straight. In the case of backup failures, our server provider's management tools are supposed to alert them so that whatever the problem is can be fixed. That didn't happen. So for two weeks, GuildPortal was flying without a backup, and nobody knew.

1/1/2013: The drive on the database server that stores the full-text index, fails, and the site goes down. When there is a drive/controller failure, our server provider's system (again) is supposed to alert them. The first tech I spoke with later acknowledged this, but said that they weren't using their old notification system anymore, since the company that acquired them was going to have them use their newer system. What happens in between? No notifications or alerts of any kind. 
 
1/1/2013, 1:30 PM: Sandy finds out the site's been down all night when she does her usual check to make sure things are running. I get on the phone with our server provider while she lets everyone know that we're on it on our Facebook page. I am connected to someone in technical support who is genuinely helpful, but who has very little SQL knowledge. 

I put up an "offline" page on the GuildPortal site, informing everyone that we'll be back, and providing a link to our Facebook page where we begin posting updates as we have them, and answering questions.
 
Over the next 7 hours, the tech repeatedly attempts to get ahold of a DBA to help with the problem. We both breathe a little easier when we get a response back in the form of an IM from the regular DBA from home. A few IMs later, we have something to try, and the tech and I give it a go, brimming with optimism. I post on Facebook that everything should be up and running within 30+ minutes.

1/1/2013, 9:30 PM: It's taking far too long. I suspect something is wrong, so I cancel the operation. It takes 4 hours to cancel, even though it was only initially running for 1. The culprit, it turns out, was the database file itself. When the N drive failed, it left the primary database file and the associated log file in a totally unstable (and it now appears, unrecoverable) state. The DBA we were IMing is now incommunicado.

No problem, we thought, because I'd been paying for the extended SQL support package faithfully for all these years. That means he can hop on the phone and get the revered On-Call SQL Team to fly in on an epic mount, day or night. While we wait for them to respond to his initial pages and phone calls and IMs, we try a couple different restore strategies. At the time we didn't know the state of the main database file, so again, we were optimistic.

I again, stupidly, post an update on Facebook saying that our latest attempt resulted in a success and that the site would be up any moment. I end up staring at a spinning ball with the words "query executing" next to it for the next several hours.

1/2/2013, 2:03 AM: The restore failed. All phone calls, pages, and IMs to the On-Call SQL Team have been totally ignored. The tech is frustrated, apologetic, and (if I read it right) a little embarrassed for his company. I post apologies and stop giving ETAs on Facebook, but continue communicating with everyone, not wanting you guys to think I'd snuggled up to a pillow and said "heck with them."

It's now 13 hours after the initial phone call, and the odds of us getting a DBA involved at all before the regular morning shift are looking slimmer with every passing minute. Dread sinks in (forgive the melodrama).

The tech and I part on the phone, since he can only really sit there and listen to me breathe while I try different ways to restore the database. Attaching, single-file attaching, standard restore, restore with full-text indexing belayed, different recovery models, etc.. After we hang up, he continues to attempt to get ahold of the On-Call SQL Team until his shift ends, to no avail. My efforts, of course, were equally doomed.

1/2/2013, 5:08 AM: I've tried everything I could think of. At this point, I post updates on Facebook and wait for a herd of DBAs to show up all bright-eyed for work at our server management company's HQ (they're two hours ahead of me).

1/2/2013, 9:09 AM: I'm informed that a DBA is working to get the server fixed. I remain cheerful and optimistic, since now we at least had a DBA on the case, and (I thought) I'd be able to give you guys a reasonably accurate ETA before too long. I post to Facebook, with a smiley face even.

I thank the DBA very much for taking it on and ask very politely and delicately (I'm careful around DBAs) if, at any time he has even the smallest update that I could give you guys, that he just shoot a real quick e-mail my way. KK? Tks u!

I head back to Facebook, posting and chatting it up to let you all know I'm staying with it until it's fixed, and also because it helps me stay awake. We all wait for the DBA to work his magic, and send us little updates that I can relay to you.

1/2/2013, 1:50 PM: The DBA comes back after some attempts with bad news. The database file is corrupt. But he has a plan that he has used before, and he is confident it will work. I post his e-mail, verbatim, on Facebook. Hope glimmers once more. Like anime eyes.

To pass the time while this plan is set in motion, I post some polls to Facebook. We all again wait, collectively, for the Awesomeness to happen. One of my guild leaders posts that he's a DBA too, and gives me a tip to pass along to Johnny that might save a lot of time.

So I call up and ask to speak with our server management company's DBA, thinking he'd appreciate the tip. Well, as soon as the guy who answered the phone IMed him that I had some info that might "help" him, he wouldn't take my call. Busy guy, I guess! Must have already thought of it anyway, right? No hard feelings...

1/2/2013, 4:02 PM: After waiting for around two hours with no update, and my guild leaders becoming understandably more anxious, I lose my patience, Samuel L. Jackson style. In a totally unreasonable rage, I write the following, unthinkably terrible thing to the DBA. The following is the actual body of the e-mail I sent. Parents, you might want to send your children out of the room. Here it is:

"Hiya Johnny! How's it going? Look like it's going to work?"

1/2/2013, 5:32 PM: The DBA replies with open hostility, accusing my "team" (I have a team?!) of detaching the database in a way that corrupted it, when in fact the loss of the N drive finished it off long before, and the failure of the entire alerting process all the way back from the backups that had failed weeks before to this point was the real villian of the story. None of which I had control over.

He closes by inviting me and my "team" to do it ourselves if we think we can do better.

At that point, I swear, the world turned upside down. It wasn't a proud moment for me, but I clicked on the reply button and type up an e-mail that conveyed some of the anger that had been building up, but more of the hurt and disbelief at what was going on.

Anyway, as far as I can tell, immediately after reading this e-mail, the DBA stopped any running restore, disconnected from his session, and walked away. I wouldn't hear anything from our hardware service provider until the following morning. The worst part is that he full well knew he wasn't just punishing me, but all of you, as well.

After I posted what had happened to Facebook, many, many guild leaders (you guys, yay!) basically raided the hardware service provider's Facebook page. Immediately upon seeing it (once they got in), they responded, saying they'd make fixing it a "top priority." While I wasn't contacted until a couple hours after that, I am sure that you guys proving you were real, and not to be trifled with, had everything to do with the fact that we would get some real results, and soon.

To my horror, the best that could be done was a restore from the last backup that had succeeded, on 12/16/2012. So any new data from then until the site went back online would be lost. Though catastrophic and totally unacceptable in my (and I'm sure, your) eyes, I had to give the go-ahead. There was just no other option, at least, none that they could or would provide.

1/3/2013, 10:14 AM: The site comes back online, with data restored from 12/16/2012.

This will not happen again.

For my part, even if we're paying our provider extra for support packages that include monitoring, alerts, and reliable backups, I will not take it for granted that anything, whether it's something I have control over or not, is working as it should be. I will check on backups and perform many of the other IT-type tasks that we have been relying on someone else to handle.

I'll invest (as soon as I can -- this event has cost us dearly, financially, and GuildPortal is already "in the red" because of the economy and lack of new games that really draw new players into the MMO world) in more hardware that will add more layers of fault tolerance to all tiers of the GuildPortal service.

Sandy has been diligently refunding all subscriptions for all new sites created after 12/16/2012 (since they're not there any more). She is nearly done as I write this update. If you fall into that category of site, and you haven't seen a refund come through PayPal yet, please give her another day to finish up. You do not need to send in a support ticket. We're not waiting for you to contact us refund you, we are just doing it.

I'm extremely sorry this happened!

This has been, without any question, the worst downtime event in GuildPortal history. It's the worst data loss event for sure (there was only one other -- the result of me writing a bad trigger that deleted a bunch of old shouts it shouldn't have).

It would be tempting to blame our hardware service provider for it all, but deflecting is their game, not mine. The short truth of it is that I should have been double-checking that they were doing their part, whether I was paying more for alerting and reliability services or not. That should have been something I considered part of my job before this all happened, and it most certainly will be, moving forward.

To all of you who lost data, I cannot begin to convey to you how much I feel your rage, anger, disbelief and loss over this. You're not simply punching in letters and numbers to practice your typing, you're building community. To have over two weeks of that just taken away as if it never happened is unacceptable. I was awake from 1/1/2013 at 8:30 AM until the site went back online at 10:14 AM on 1/3/2013, because I wanted to keep providing updates, or at least keep the lines of communication open while we were down... To let you know that your guild is very important to us, whether it's a paid site or not. I didn't choose the guild web hosting vertical niche just because there was nothing else filling it back when we started; I love the idea of providing and enhancing tools for people to build and personalize their communities online, and I just so happen(ed) to be a gamer.

I feel terrible for disappointing you. While the financial hit to us is tremendous, GuildPortal will survive and be here for your guild for years to come. I'm currently looking for another job, as, like I said before, we're in the red, and have been for some time. Once I have one, I'll still be able to provide support and some feature enhancements; I'll just be limited to a couple hours a day or so.

Thank you!

If you are going to leave over this, or have already left, thank you for making your home with us for however long you did. Thank you to all you guild leaders and members who were understanding and supporting on our Facebook page. Thank you to those who are staying, and a promise: it won't happen again.

As always, thank you for choosing GuildPortal!