Yesterday was suckie

Let me first apologize for the downtime some Articulate Online users experienced yesterday. I realize this was a tremendous inconvenience for everyone effected and I am really sorry this happened.

At about 5:30 AM PST time yesterday I began to receive lots of emails, phone calls, and instant messages that Articulate Online was down. Over the next several hours I was frantically trying to figure out what caused the outage, who was affected, and how long it would take to fix. The problem that occurred caught us completely by surprise, but I wanted to take some time to explain what happened and how we will prevent this from happening in the future.

In case you do not know, I am the Quality Assurance Project Lead for Articulate Online, and it is my responsibility to get things Articulate Online up and running in such events.

So what did happen yesterday?

Early yesterday morning our datacenter brought a new server online that had a conflicting MAC/IP address with our primary Articulate Online IP. This caused traffic that was being routed to us to never make it past the network switch.

We noticed this issue very early in the morning and quickly contacted our datacenter to have the problem resolved. Unfortunately it took our datacenter much longer to identify the problem than we could have imagined, and once it was identified it took them much longer than anticipated to fix the problem.

We have used that datacenter in the past, and in the past have been satisfied with their level of service, but the problems yesterday forced us to lose some confidence in their ability to quickly and effectively handle these issues. I spent several hours on the phone with our datacenter yesterday, and I am truly disappointed it took as long to fix as it did.

So what took so long to fix the problem?

When the issue was originally identified we assumed it would only take about 20 minutes to address the problem and we became increasingly frustrated that it was taking as long as it was. The problem was originally diagnosed by our datacenter as a problem with a network driver on one of our servers, but after updating the driver there was no change. After it was identified that the driver was not at fault it took our datacenter a considerable amount of time to identify the correct issue with the MAC/IP address conflict, and to find the server that caused that issue. After the server was identified and taken offline it took our datacenter even more time to bring back our server back online.

We had contemplated during the downtime switching to our backup systems as well as redirecting any traffic at the DNS level to a status page to inform our customers what was going on. We decided initially against making any changes at the DNS level to reroute any traffic to a status page because DNS changes can often take hours to take effect, so downtime would only be exacerbated by making such a change. We also decided against doing a failover to our mirrored systems because we only thought we were 20 minutes away from the problem being fixed and that it would require a longer period of time to get the mirrored systems running and would have also required some DNS/IP changes that would have taken time.

When it became apparent that the fix was going to take longer than anticipated we did start updating our DNS server to redirect traffic that was going to Articulate Online to a status page that would give updates as to Articulate Online status. However, just as we were beginning to get the status pages up the problem was addressed.

In retrospect, we could have handled this better on our end in regards to failing over to our backup system in a more timely manner. We are working on improving our own internal process to make our resorting to our backup an easier and quicker process.

What is being done to ensure this will not happen again?

Making sure that this does not happen again is a priority for us. We have had several meetings to discuss what can be done to prevent something similar from happening in the future.

Here are the steps that we are going to take over the next month to prevent this from happening again:

1.New datacenter - We are currently working on migrating our existing servers to a datacenter that provides a more reliable infrastructure and support system. Rackspace is a Tier 1 datacenter and is widely recognized as one of the best datacenters. We are committed to migrating to the new datacenter within the next 30 days.

2.Improved communication – We just today created a new site (http://heartbeat.articulate.com/) that will give real time status updates of the health of Articulate Online and our other resources. We have also included a RSS feed for the page so that you can subscribe to it to see any changes to the health of AO.

3.Improved failover process – I just had a couple hour meeting with internal employees of Articulate where we discussed ideas for how we can improve the failover process. In the upcoming weeks we are going to make some changes to our failover process to make it easier and quicker to switch to our backup systems.

Once again, I am really sorry for yesterday, it was really awful for all parties involved. If you have any feedback to give, I would be more than happy to listen, and feel free to post any comments that you may have.

There is also a forum thread where we describe what took place yesterday, and covers this all in less technical detail:
Articulate Online back online. Downtime explained.

-Dave

AddThis Social Bookmark Button

2 Responses to “Yesterday was suckie”

  1. # Anonymous Anonymous

    Hi Dave,

    I really appreciate this kind of transparency and effort in communication. Perhaps other companies will follow your lead.  

  2. # Blogger Dave

    Thanks Sam.  

Post a Comment