Microsoft apologizes for three-day Outlook.com outage, says caching issue was to blame
Microsoft today confirmed it has finally resolved an issue with Outlook.com that has been affecting some users for up to three days. The company apologized multiple times to those affected, explained what happened as well as what it has done to avoid the problem again.
The issue first began on the morning of August 14 (EST), with many users reporting issues access Outlook.com, SkyDrive, and People services. While Microsoft’s own Live.com dashboard showed the file storage and contacts services came back online rather quickly, many continued to experience problems with Outlook.com for hours (I didn’t have access for about three hours), and some continued to suffer until early this morning (August 17).
At 4:34 AM EST this morning, Microsoft posted a more detailed update around the issue, saying it made two changes to make the service more resilient in the future. The first involved increasing network bandwidth in the affected part of the system, and the second involved changing the way error handling is done for devices using Exchange ActiveSync.
The company blamed this particular incident on a failure of the caching service that interfaces with devices using Exchange ActiveSync, including most smartphones. The failure caused these devices to receive an error and continuously try to connect to Microsoft’s service, which resulted in a flood of traffic that the company’s servers did not handle properly.
As a result, some users could not access their accounts and Microsoft was forced to temporarily block access via Exchange ActiveSync. The company could then restore access to Outlook.com via the Web and restore the sharing features of SkyDrive, which took “a few hours of the initial incident.”
Unfortunately, Microsoft still had a “significant” backlog of Exchange ActiveSync requests to work through, which it had to do slowly in order to prevent the issue from resurfacing, meaning “some customers remained impacted for a longer period of time.” The company says the backlog is now clear and the service has been restored for all.
Here’s the full explanation:
We want to apologize to our customers who were affected by the outage on Outlook.com this week. We have restored access to all accounts and have made changes so that the service will be more resilient in the future. We realize that we have a responsibility to the customers who use our services to communicate and share with the people they care most about, and we apologize for letting those customers down this week.
Our first priority is to the health of the services, and we will learn from this incident and work to improve the experience of all our customers. As part of that, we would also like to provide more detail about what happened.
This incident was a result of a failure in a caching service that interfaces with devices using Exchange ActiveSync, including most smart phones. The failure caused these devices to receive an error and continuously try to connect to our service. This resulted in a flood of traffic that our services did not handle properly, with the effect that some customers were unable to access their Outlook.com email and unable to share their SkyDrive files via email.
In order to stabilize the overall email service, we temporarily blocked access via Exchange ActiveSync. This allowed us to restore access to Outlook.com via the web and restore the sharing features of SkyDrive. These parts of the service were fully stabilized within a few hours of the initial incident. A significant backlog of Exchange ActiveSync requests accumulated as we worked to stabilize access. To avoid another flood of traffic, we needed to restore access to Exchange ActiveSync slowly, which meant that some customers remained impacted for a longer period of time.
We have learned from this incident, and have made two key changes to harden our systems against future failure – one that involved increasing network bandwidth in the affected part of the system, and one that involved changing the way error handling is done for devices using Exchange ActiveSync. We will continue to monitor the system and make additional changes as needed to keep the service healthy.
We are now fully through the backlog and have restored service so all customers should have normal access from all of their devices. We want to apologize to everyone who was affected by the outage, and we appreciate the patience you have shown us as we worked through the issues.
This week’s Outlook.com problems aren’t completely resolved. A new problem report posted today at 2:36 PM EST says “A small percentage of mobile users may experience intermittent issues while syncing email.” We’ll let you know when it’s completely resolved.
Top Image Credit: Alec Perkins