|
Any organization with more than one location must often make important tradeoffs about network communication among its locations. In an ideal world where network bandwidth is infinite, latency is zero, communication cost is free, and links are never down, such tradeoffs would become irrelevant. It would be sensible to centralize all infrastructure and everyone would have cheap, reliable, and easy access to those centralized resources. However, in the real world there are tradeoffs to be made: costs vs. what can be purchased in different locations vs. physical limits such as the speed of light. People designing networks and distributed applications are familiar with these tradeoffs, and often balance them skillfully in working through a problem. While many IT experts have great intuition about the design tradeoffs involved with distributed access to business data, it was only recently that the research community codified an important tradeoff in a simple and useful theorem. This theorem relates the tradeoffs that exist among consistency, availability, and partition-tolerance (CAP) for systems that provide distributed access to data. In this terminology, consistent means that any part of the overall system, if and when it responds to a request for data, provides precisely the correct data. Available, on the other hand, means that all parts of the overall system are “always up,” and any component will promptly provide an answer to any request. Finally, partition-tolerant means that the system continues to function in the face of network disruption (or partitions). The “CAP theorem” says that while it is possible for a system for distributed data access to possess two of the three properties of consistency, availability, and partition-tolerance, it is outright impossible to achieve all three simultaneously. Said another way, when the network goes down (implying the system has no choice but to become partition-tolerant), you must give up either consistency or availability. Either you can’t get to your data (“hey, the network is down”), or you run the risk of causing data inconsistencies (“hey, someone changed the file I had opened and I lost my work!”). Work by professors at U.C. Berkeley and MIT means that this conjecture is now a proven theorem. In a nutshell, there’s no free lunch with distributed data access. Try as they may… Certain vendors of file caching systems would like you to believe there is no CAP theorem. Since lack of availability is easy to see, these vendors have typically chosen some kind of availability gain at the expense of consistency or partition-tolerance. The designers then hope that problems are rare enough to be ignored or explained away. Armed with knowledge of the published impossibility results in the scientific literature, it is possible to play detective on these systems and find where the hidden problems are. File Cache A File cache system A claims to support disconnected operation. However, system A only attempts to do so for about 1 minute. After that 1-minute window, the remote sites are unable to use the files – which means that the system is no longer available. In addition, system A actually supports only limited availability (reads on open files) during that 1-minute window. System A can also be operated in a mode where consistency is guaranteed, but then does not support disconnected operation. In addition, system A offers little or no performance gain in that configuration. File Cache T File cache system T also claims to support disconnected operation. In the T system, file system locks are held by the server-side unit so as to preserve consistency on behalf of a remote client-side unit. Unfortunately, a partition separating file-using clients from the server for any significant length of time will mean that it is impossible to release such locks at the server. The files are effectively held hostage by the remote client-side unit, and accordingly are not available to any other user of the system. Writes are buffered at the client-side unit in the hope of executing them later, when the partition is repaired. If the client-side unit fails during a partition, writes to files can be permanently lost even when the user believes the data has been successfully saved.
|