IBM HACMP supports a wide variety of configurations, and provides the cluster administrator with a great deal of flexibility. With this flexibility comes the responsibility to make wise choices. This paper discusses the choices that the cluster designer can make, and about the alternatives that make for the highest level of availability.
HACMP
HIGH AVAILABILITY CLUSTER
MULTIPROCESSING
BEST PRACTICES
July 2007
. A l e x A b d e r r a z a g . Ve r s i o n 0 2 . 0 0 . Table of Contents
I. Overview
1
II. Designing High Availability
1
Risk Analysis
2
III. Cluster Components
3
Nodes
3
Networks
3
Adapters
5
Applications
5
IV. Testing
9
V. Maintenance
9
Upgrading the Cluster Environment
10
VI. Monitoring
12
VII. HACMP in a Virtualized World
13
Maintenance of the VIOS partition - Applying Updates
18
VIII. Summary
19
IX. References
21
X. About the Authors
21
H A C M P B e s t P r a c t i c e s
2WHITE PAPER
Overview
IBM High Availability Cluster Multiprocessing (HACMP TM) product was ?rst shipped in 1991 and is now in its 14th release, with over 60,000 HACMP clusters in production world wide. It is generally recognized as a robust, mature high availability product. HACMP supports a wide variety of con?gurations, and pro-vides the cluster administrator with a great deal of ?exibility. With this ?exibility comes the responsibility to make wise choices: there are many cluster con?gurations that are workable in the sense that the cluster will pass veri?cation and come on line, but which are not optimum in terms of providing availability. This document discusses the choices that the cluster designer can make, and suggests the alternatives that make for the highest level of availability*.
Designing High Availability
".A fundamental design goal of (successful) cluster design is the elimination of single points of failure (SPOFs)."
A High Availability Solution helps ensure that the failure of any component of the solution, be it hardware, software, or system management, does not cause the application and its data to be inaccessible to the user community. This is achieved through the elimination or masking of both planned and unplanned down-time. High availability solutions should eliminate single points of failure (SPOF) through appropriate de-sign, planning, selection of hardware, con?guration of software, and carefully controlled change manage-ment discipline.
While the principle of "no single point of failure" is generally accepted, it is sometimes deliberately or in-advertently violated. It is inadvertently violated when the cluster designer does not appreciate the conse-quences of the failure of a speci?c component. It is deliberately violated when the cluster designer chooses not to put redundant hardware in the cluster. The most common instance of this is when cluster nodes are chosen that do not have enough I/O slots to support redundant adapters. This choice is often made to re-duce the price of a cluster, and is generally a false economy: the resulting cluster is still more expensive than a single node, but has no better availability.
A cluster should be carefully planned so that every cluster element has a backup (some would say two of everything!). Best practice is that either the paper or on-line planning worksheets be used to do this plan-ning, and saved as part of the on-going documentation of the system. Fig 1.0 provides a list of typical SPOFs within a cluster.
"..cluster design decisions should be based on whether they contribute to availability (that is, eliminate a SPOF) or detract from availability (gratuitously complex) ."
* This document applies to HACMP running under AIX®, although general best practice concepts are also applicable to HACMP running under Linux®.
H A C M P B e s t P r a c t i c e s
1Fig 1.0 Eliminating SPOFs
Risk Analysis
Sometimes however, in reality it is just not feasible to truly eliminate all SPOFs within a cluster. Examples, may include : Network ¹, Site ². Risk analysis techniques should be used to determine those which simply must be dealt with as well as those which can be tolerated. One should :
Study the current environment. An example would be that the server room is on a properly sized UPS but there is no disk mirroring today.Perform requirements analysis. How much availability is required and what is the acceptable... [download for more]