Fuzzy matching city names

Recently, my team was tasked with building a system that would allow users to look up weather forecasts for arbitrary cities in the United States. The system had to allow for misspellings to give the user a forgiving but predictable interface. If the misspelling clearly matched a single city, it would use that city; otherwise it would present the user a list of options. This article describes the techniques used to perform this task.

When building an interface that can handle misspellings, a brute force approach might be to try to anticipate the types of mistakes that users might make. This quickly becomes untenable. Imagine doing this for all 29,000+ cities in the United States. You might focus on the most common cities, but that doesn’t really accomplish what you want.

PHP provides a number of powerful functions that will help you with this:

  • soundex() – generates a 4-character pronunciation-based key; in theory, two strings that are pronounced the same will have the same key
  • metaphone() – uses rules of English pronunciation to generate a variable-length key; in theory, two strings that are pronounced the same will have the same key
  • levenshtein() – calculates the “distance” between two strings (defined as the number of insertions, deletions, and substitutions required to transform one string into another)
  • similar_text() – returns the number of matching characters in two strings, and optionally, a percentage similarity between them.

We can use a combination of these functions to build a fuzzy matching algorithm that is fairly robust and straightforward to implement.

A basic algorithm

Let’s work under the assumption that our users will make two types of misspellings:

  • “best guess” spellings based on pronunciation when the real spelling is unknown (in this case if the guess is good, the pronunciation-based keys should be the same)
  • single character omissions and substitutions due to inaccurate keyboarding (in many of these cases, the pronunciation may not be altered enough to change the key)

Let’s further assume that our users will input city names in one of two forms:

  • Form 1: City, State Abbreviation
  • Form 2: City (with no state)

Our algorithm might look like this:

(all string comparisons should be done case-insensitively, of course)

The caller of this routine must be prepared to accept one or more cities with similarity scores; it can then filter the result set using a pre-defined threshold of similarity to weed out overly obscure matches.

Optimizing for performance

What is the primary weakness of this approach? If not implemented carefully, it can be very computationally expensive. You do not want to have to perform a linear scan of the city names and compute their metaphone keys at lookup time. Instead, you should precompute and store the metaphone keys in the database.

By precomputing and storing the keys, you are able to transform this fuzzy search problem into one of exact string matching, something that database systems excel at doing. You are essentially matching three strings: state abbreviation, city name, and metaphone of city name. By applying an index on all three columns, you can execute extremely fast fuzzy searches.

Extending the algorithm

You could extend this algorithm to make use of both soundex() and metaphone(), which might be helpful, as each has different strengths and weaknesses. Neither is perfect at extracting the pronunciation of every English word (and of course, many U.S. city names are not English in their derivation anyway). So instead of matching on just one column, you could match on either the soundex key or the metaphone key. This will broaden the result set, but that’s OK, as your string similarity filter will tighten things up anyway.

In practice, we also found that our database’s encoding of city names was a bit restrictive. For instance, New York City was stored in our database as “New York”. A user entering “New York City” or “NYC” would not match the soundex or metaphone keys for “New York” at all. So we do a little pre-conditioning of the user input, looking for colloquial versions of city names (e.g. “Vegas” instead of “Las Vegas”). After doing a little search-and-replace, we send the input through the matching algorithm as described above.

Leave a comment

Your email address will not be published. Required fields are marked *