On Saturday, February 17, 2018, thousands of AT&T email subscribers incurred an unexpected five-hour network outage. During this outage users with email accounts on att.net received password error messages while attempting to retrieve and send email messages. Users were never notified or updated by AT&T about the outage. Instead they had to inquire via social media or web-based discussion groups to learn that the outage was attributed to a server issue.
As an impacted subscriber I couldn’t help but apply my IT Service Management process background to consider what ITSM best practices might have prevented this lengthy outage.
Monitoring and Event Management
An automated Event Management monitoring process should have been implemented to ensure a consistent failure detection and failover solution. An automatic server failover solution could have prevented the website from going down in the event of a server failure. Through automatic detection, an error on the primary server can be detected and traffic will automatically be sent to a backup server.
Event Management monitoring details with impacted components should automatically trigger incident and problem records with a high priority classification with routing and notifications to those responsible for expediting a timely review and resolution. Statuses of these incidents should also be promptly communicated to support staff and end users in real time via automated knowledge management tools.
Event Management precautions should be considered when planning for service, infrastructure or application upgrades or enhancements to immediately detect any deviation from normal or expected operation. Active monitoring tools should be maintained to poll key components to determine their status and availability and be capable of generating an alert for any exceptions that require immediate tool or team action.
Proactive Capacity Management
An automated ongoing Capacity Management monitoring system should have been implemented to monitor, measure, report and review the current performance of services and components – responding to all capacity-related threshold events and triggering immediate corrective action. Adequately monitoring, assessing, identifying and tracking current levels of resources utilization and service performance should have been considered for estimating future requirements for planning server upgrades and enhancements. Proactively modelling predicted changes in IT services and identifying the changes needed to be made to components ensures that the appropriate resource is available for service demands.
Availability Management Considerations
During the design of any service or infrastructure upgrades or enhancements consideration should be given to ensure that the change meets the availability requirements of the business and its customers without compromising the performance of existing services. In the event of any service failure, the service and its supporting components should be reinstated to enable normal business operations to resume as quickly as possible.
Service Validation and Testing
A Service Validation and Testing plan should be implemented to ensure that each release of a new or upgraded service, application and infrastructure is built, tested and deployed without risk to the production environment. Implementing service acceptance criteria for testing new or enhanced services that ensures uninterrupted service delivery. Integrating automated testing tools and monitoring systems into the service design and transition lifecycle stages would enable prompt detection and removal of functional defects prior to releasing changes to the live environment.
Benefits
By adopting a preventative and proactive ITSM approach, AT&T would foster an agile environment that gives the business the ability to respond to changing conditions quickly and without disruption to services, thereby enabling an environment of continuous improvement in the following areas:
• Increased customer satisfaction through continuous service
• Increased service availability and decreased service interruption
• Reduced unplanned labor and costs for the business and support staff
Have you experienced a similar issue with an outage of on online service? What suggestions for improvement did you make?