Server Outage on Feb. 10, 2015

Posted: February 11, 2015
By: Jim Donovan
Yesterday afternoon, We Think Solutions (WTS) experienced a server outage at one of our data centres from approximately 2pm to 5pm Eastern time. 
Fortunately, we recognized this immediately and we were able to implement our failover strategy which we have devised for this situation and most of our affected clients websites and email were up and running within 45 minutes.
Unfortunately, a small number of our clients' websites were not protected by our failover strategy and those websites were down for the entire 3 hour period.
We have learned from this experience and in an effort to be transparent here are the details of what transpired along with some background.
Our Dedicated Servers
In order to provide our clients with the optimal level of web hosting, we lease and operate several Dedicated Servers at various data centre locations across North America.  As the name implies, these servers are 100% dedicated to We Think Solutions and not shared with others.  Data centres offer levels of redundancy, security, and reliability that are simply not achievable by running "in-house" servers.
By design, we operate only one Dedicated Server at each data centre.  This is to provide geographic redundancy in the event of a major catastrophe such as 9/11, an earthquake, or a hurricane.  One of our Dedicated Servers is utilized exclusively as a backup server for emergency situations.  The others are primary servers hosting client websites.
For the most part, our dedicated servers have proven to be very reliable.  In 10 years of business we have experienced 5 "outages", typically lasting 2 or 3 hours.
After experiencing 2 such outages in 2013, we lost confidence in one of our data centre providers, terminated our relationship with them, and replaced them with iWeb ( which is where today's outage occurred.  Prior to today's event we had not experienced a failure of any kind for over 13 months.  With annual revenues of over $45 million dollars, iWeb is one of Canada's leading hosting companies and is ranked #88 in Branham's Top 250 Canadian IT companies for 2014.
In any event, our past experiences have taught us that no data centre is immune from failure.  This is why we created our own failover system in 2014 to protect our clients in the event of an outage at one of our data centres.
Our Failover System
On a weekly basis we prepare a full backup of all of our primary servers and transfer those files to our dedicated backup server.  In the event of an outage at one of our primary data centres, we quickly "re-point" the address of the affected "WTS name server" to the backup server.  Within minutes, this change propagates throughout the Internet and most importantly, your website is up and running again. Additionally, if you are running email from our servers, your email service will be back up and running from our backup server within a similar time frame.  It is important to note that if you are using IMAP to access your email, you will not have full access to your email history during the outage since the backup server does not contain a perpetual backup of your email.  Once the outage clears, we then "re-point" the name server back to its original location, synchronize the email accounts, and everything reverts to normal.  
Yesterday's Outage
Our technical staff was alerted to the outage today at 1:55 pm but it was not acknowledged by iWeb until approximately 2:05.  Our senior developer, Jojo, was the first user to publicly alert iWeb via Twitter about the problem. We decided to wait for a few minutes for an update from iWeb before deciding to implement the failover strategy (in case it was a minor issue).  At approximately 2:20, iWeb posted an update stating that the ETA was unknown for this issue so we immediately implemented our failover strategy.  Note that this issue only affected our client websites that reside on the iWeb dedicated server (with name server WETHINKSERVER.COM).
By 2:40pm, we had completed all of the steps in our Failover Strategy and most of the affected sites were up and running.  It should be noted that changes of this nature can take a few minutes to propagate throughout the Internet and in some rare instances can take several hours to propagate to specific users.  This is the nature of the DNS system which is beyond our control.  It should also be noted that this was the first time that our failover strategy has been deployed so we proceeded very carefully in order to get it right rather than to get it done as fast as possible.
At approximately 5pm, our iWeb server was back online and by approximately 7:30 pm we received the "all clear" signal from iWeb indicating that all systems are back to normal.  We immediately pointed the name server for "WETHINKSERVER.COM" back to the iWeb server, synchronized the email and within 60 minutes everything was back to "normal".
Here is iWeb's chronology of the incident, which they state was caused by an failure within an Uninterruptable Power Supply (UPS):
Lessons Learned
Although things generally went according to plan during this outage, a few of our clients experienced 3 hours of website downtime.  There were two primary reasons for this:
1. Newer Sites:
A flaw was discovered in our failover strategy whereby some of our relatively newer sites were not properly prepared for launching on the backup server despite their inclusion in the backup files.  This was due to human error in the design of our failover strategy and we have now devised a method to ensure that this will not occur again in the future.  By 4:50pm we had recognized this failure and took steps to rectify the problems with these sites.
2. Sites that are not using We Think Solutions Name Servers:
A few of our clients have historically kept control of their own Name Server records rather than utilizing We Think Solutions Name Servers.  These are mostly clients who operate their own email servers.   The downside to this approach is that our failover strategy relies on the Name Servers being in our control.  We will be contacting all clients who fall into this category to discuss the best approach moving forward.
3. Menu problems:
Some of our sites experienced temporary issues with their menus due to the architecture of our website framework system.  We have taken the necessary steps to ensure that this will not occur in the future.
4. Email:
We have always offered a basic email service to our website clients.  Today's outage underlines the fact that the potential exists for your WTS email service to be down for a period of time.  You should also be aware that although we make frequent backups of our servers, there is a remote possibility that portions of your email history could be lost in the event of a disk failure. If you use POP to access your email then it is likely that our servers do not contain any of your email history. If your company's email is mission critical to your business and if the loss of several email messages would be very detrimental, we recommend using a cloud-based service such as Google Apps/Gmail or Office 365. These providers offer perpetual backups and have robust redundancy built into their systems. The cost of these third-party email services is very affordable at approximately $5.00/month per user.  If you have reservations about using these cloud computing services then we recommend that you contact a networking/email expert about a dedicated email server such as an Exchange Server.
Editorial Note: We have been using Gmail since 2011 and it was one of the best business decisions we have made.  We are an authorized reseller of Google Apps so please contact us if you are interested in moving to Gmail.
History has taught us that data centre outages are inevitable in our business.  We recognize that our clients cannot endure extended periods of downtime so we have devised a failover strategy to mitigate the effect of any such outages.
Yesterday was the first opportunity for us to put our failover strategy into action.  As is generally the case with any new technology, we discovered some minor flaws in our system but in general it worked as planned and most of our client's websites were up and running within 50 minutes of the original outage.
We apologize to those clients who did not benefit from our failover strategy during yesterday's outage.  We have learned lessons during this event and we can assure you that the same mistakes will not be repeated when we are required to implement this strategy in the future.
We will be in contact with iWeb to determine what steps they are taking to ensure that a similar event will not occur in the future and if necessary we will consider moving to another data centre.
Please feel free to contact me directly if you have any questions about the above.
Jim Donovan
1-800-231-1020 x38
Posted under Website Maintenance
Back To Blog Index