Find White Papers
Home About Contact Help
Free Membership Member Login
Search the Library                  Advanced Search

Best Practices in High Availability Cluster MultiProcessing

IBM
By : IBM
INFORMATION
Published : Jul 16, 2007
Length : 24
Type : White Paper
 
Download Now
Save for Later
  Email This Page
Overview :

IBM High Availability Cluster Multiprocessing supports a wide variety of configurations, and provides the cluster administrator with a great deal of flexibility. With this flexibility comes the responsibility to make wise choices: there are many cluster configurations that are workable in the sense that the cluster will pass verification and come on line, but which are not optimum in terms of providing availability.

Read about the choices that the cluster designer can make, and about the alternatives that make for the highest level of availability, in this IBM white paper.

View All Items By This Company
Browse Related Categories :

Backup And Recovery

,

High Availability

,

Network Management

,

Utility Computing

 
A High Availability Solution helps ensure that the failure of any component of the solution, be it hardware, software, or system management, does not cause the application and its data to be inaccessible to the user community. This is achieved through the elimination or masking of both planned and unplanned downtime. High availability solutions should eliminate single points of failure (SPOF) through appropriate design, planning, selection of hardware, configuration of software, and carefully controlled change management discipline.
While the principle of "no single point of failure" is generally accepted, it is sometimes deliberately or inadvertently violated. It is inadvertently violated when the cluster designer does not appreciate the consequences of the failure of a specific component. It is deliberately violated when the cluster designer chooses not to put redundant hardware in the cluster. The most common instance of this is when cluster nodes are chosen that do not have enough I/O slots to support redundant adapters. This choice is often made to reduce the price of a cluster, and is generally a false economy: the resulting cluster is still more expensive than a single node, but has no better availability.
A cluster should be carefully planned so that every cluster element has a backup (some would say two of everything!). Best practice is that either the paper or on-line planning worksheets be used to do this planning, and saved as part of the on-going documentation of the system. Fig 1.0 provides a list of typical SPOFs within a cluster.

Risk Analysis
Sometimes however, in reality it is just not feasible to truly eliminate all SPOFs within a cluster. Examples, may include : Network 1, Site 2. Risk analysis techniques should be used to determine those which simply must be dealt with as well as those which can be tolerated. One should :
Study the current environment. An example would be that the server room is on a properly sized UPS but there is no disk mirroring today.
Perform requirements analysis. How much availability is required and what is the acceptable likelihood of a long outage.
Hypothesize all possible vulnerabilities. What could go wrong?
Identify and quantify risks. Estimate the cost of a failure versus the probability that it occurs.
Evaluate counter measures. What does it take to reduce the risk or consequence to an acceptable level?
Finally, make decisions, create a budget and design the cluster.

Cluster Components
Here are the recommended practices for important cluster components.
Nodes
HACMP supports clusters of up to 32 nodes, with any combination of active and standby nodes. While it is possible to have all nodes in the cluster running applications (a configuration referred to as "mutual takeover"), the most reliable and available clusters have at least one standby node - one node that is normally not running any applications, but is available to take them over in the event of a failure on an active node.
Additionally, it is important to pay attention to environmental considerations. Nodes should not have a common power supply - which may happen if they are placed in a single rack. Similarly, building a cluster of nodes that are actually logical partitions (LPARs) with a single footprint is useful as a test cluster, but should not be considered for availability of production applications.
Nodes should be chosen that have sufficient I/O slots to install redundant network and disk adapters. That is, twice as many slots as would be required for single node operation. This naturally suggests that processors with small numbers of slots should be avoided. Use of nodes without redundant adapters should not be considered best practice. Blades are an outstanding example of this. And, just as every cluster resource should have a backup, the root volume group in each node should be mirrored, or be on a RAID device.
Nodes should also be chosen so that when the production applications are run at peak load, there are still sufficient CPU cycles and I/O bandwidth to allow HACMP to operate. The production application should be carefully benchmarked (preferable) or modeled (if benchmarking is not feasible) and nodes chosen so that they will not exceed 85% busy, even under the heaviest expected load. 
Search the Library                  Advanced Search
About Us Contact Us List Your Papers Partner With Us Site Map