|
Despite many remarkable advances made in other areas of business automation, automated processing and matching of personal names in databases has languished for decades without significant theoretical or practical advances. The purpose of this paper is to highlight the issues, requirements, and technologies available for automated advanced name recognition.
The problem to be solved is a familiar one for many people: a name is entered in one database with the surname "Rodgers," and in a different database as "Rogers." A person's name is recorded as "Dayton," but should actually be spelled "Deighton." The problem is greatly compounded with names originating outside North America. For example, the same Chinese person may have one set of information recorded under the surname "Xue," and another under the surname "Hsueh." The earliest attempt at coping with name variation was the Russell Soundex matching algorithm, developed around 1910 as an aid in the manual analysis of U.S. Census records. The original Soundex Encyclopedia method of generating ?keys' was later implemented as a software-based algorithm, and is today the most widely used alternative to exact-matching when names are involved in automated search and retrieval systems. Over the years, there have been many attempts to improve on Soundex, but they are all still key-based systems and, therefore, suffer from the same fundamental deficiencies that plague Soundex.
While it is certainly compact and efficient, the key-based approach falls well short of solving many of the problems associated with searching for names. Two extensive studies examined the results of the basic Soundex algorithm, using statistical measures to gauge accuracy.
- Study #1 Results: Only 33% of the matches that would be returned by Soundex would be correct. Even more significant was the finding that fully 25% of correct matches would fail to be discovered by Soundex. (Alan Stanier, September 1990, Computers in Genealogy, Vol. 3, No. 7)
- Study #2: Only 36.37% of Soundex returns were correct, while more than 60% of correct names were never returned by Soundex. (A.J. Lait and B. Randell, 1996)
Obviously, for mission-critical federal applications such as terrorist watch-lists, INS tracking, visa applications, and fraud detection, failing to identify 25-60% of target names within a database is unacceptable. The Federal Government recognized this deficiency, and worked with IBM Global Name Recognition over the past two decades to develop advanced technology for improving performance across multiple cultures. This approach hinges on the latest advances in computational linguistics ? the application of statistics, mathematics, linguistics research, and computational expertise to the problem of name matching. This approach is now also available for commercial organizations.
IBM Global Name Recognition technology is the ONLY name searching patented software since Soundex!
Elements of the IBM High-Precision Name Matching System
In order to meet the challenges posed by large, multi-cultural databases in which both predictable and random name-spelling variations are present in a significant number of records, an IBM Global Name Recognition solution provides:
1. Culture-specific matching criteria. Naming systems differ significantly from one culture to the next in the relative order in which parts of a name appear, in the consistency with which they are written in romanized form, in the way they are abbreviated, and in which parts are considered mandatory for identification. To identify all potential matches accurately, IBM technologies must first determine a name's culture of origin. Such knowledge allows the correct set of matching techniques to be applied to the name. IBM accomplishes cultural identification automatically, adding speed and consistency that humans cannot be expected to provide.
2. Automatic application of linguistic rules for the culture/ language context. A full name must be parsed, and possible word order variations and shortened forms must be identified. Spelling variants for each part of the name are calculated. There are many possible approaches to this step rule-based, algorithmic, statistical/probabilistic, or combinations of these. Furthermore, variants may be based on either phonetic (pronunciation) or alphabetic similarity. IBM has accumulated over 750,000,000 names from every country in the world. These names are used to provide the automated statistical and linguistic methods required for accurate name matching.
3. Noise tolerance. (e.g., typographical errors) Once culture-specific knowledge has been used to isolate and align those portions of the name to be compared, the character-level comparisons take into consideration the possibility of random keying, which correspond to no orthographic or phonological principle.
|