Do you monitor for IT change?
Ensuring the continuous availability of critical systems is the Holy Grail for IT organizations. To move towards that goal, IT organizations have invested heavily over the past decades to build sophisticated systems that help them prevent, monitor, detect, and remediate failures in the least amount of time - maximizing uptime and minimizing downtime.
Historically, bottlenecks in infrastructure tended to be around hardware failures, capacity constraints and network bandwidth availability. To solve these problems, organizations invested in solutions to monitor these aspects of their daily operations. Today it is routine for organizations to monitor hardware, utilization (CPU, Memory, Network) and network traffic for anomalies or failures, and when we ask organizations whether they monitor these parts of their organization, we invariably have heads nodding in assent.
The organization’s capabilities in these areas have become more sophisticated, and as a result, the bottlenecks have shifted. There is the occasional hardware failure or capacity overload, but these are rare and easily resolved without a great deal of cost in most cases. Today, the bottleneck is different. The cause of most downtime has shifted to change. Industry research shows that up to 80% of system unavailability is caused by incorrectly applied changes. The natural question that follows is: Do you monitor and control change?
We tend to see far fewer people nodding their heads in agreement this time. It’s not surprising because downtime caused by change is a more difficult beast to manage. To begin with, change happens almost continuously and with much greater frequency than a hardware upgrade, for example. Second, change tends to be complex and interdependent, with multiple parties involved. In most organizations, the impact, dependencies and ramifications of change are not known fully until it is actually deployed in production. Finally, the ways in which change occurs in the organization has its roots deep in the organization’s culture and behavior.
Since most unavailability is caused by change, getting control of change in your environment would be the logical next step in the evolution of systems management for maintaining high availability. Given the difficulties outlined above, how might this be done? We will look at availability in two closely related, but separate categories. First, what can be done to increase uptime? And second, what can be done to recover quickly from downtime? In ITIL (Information Technology Infrastructure Library) terms, increasing uptime is about protecting the service while making changes as part of Change Management. Decreasing downtime is about quick resolution of incidents and is part of Incident Management. Figure 1 shows these two components of increased availability.
Organizations generally increase service uptime through improved change process. The most common means by which organizations improve their change process is by implementing a change management system. As the graphic below indicates, the key actions required for better change process are:
Defining Change Policies.
Define the rules and circumstances in which changes can be implemented. These rules can be encoded using a change management system (e.g. “high priority changes require two approvals”), but validating that the rules were followed is still an exercise in faith and hope in most cases.
Measure Actions and Events
One of the key tenets of ITIL is measurement. How can change activity be measured and reconciled against the documented change process? Do you know how many changes were made in the organization? Which of those changes were made within prescribed change windows? Change management systems help somewhat by allowing the automated tracking of change requests. However, reconciling these requests against change activity is still a manual process.
Enforce Change Processes
Once change policies are defined and socialized within the organization, there must be a way to enforce them. Today’s best practices rely on edicts to adhere to change process passed on from those higher in the organization. To verify people follow the process, elaborate reporting and documentation requirements have been put in place to minimize the risk of out-of-process changes. In other words, enforcing change process is largely manual.
As Figure 2 shows, organizations trying to execute the actions required to achieve increased uptime are stymied by the manual effort involved, including socializing policies, relying on threats for enforcement, and manually reconciling change activity against process.