An implementation of the Soundex Algorithm in Python

February 14th, 2010 by Kevin | Posted under programming.

Soundex is a phonetic algorithm for indexing names by sound, as pronounced in English. The Soundex heuristic can be used for identifying names that sound alike but are spelled differently. The Soundex algorithm is a standard feature of MS SQL and Oracle database management systems to search for similar sounding names.

NAME SOUNDEX SIGNATURE
Smith S530
Smythe S530
Schultz S243
Shultz S432

Here if we are searching for the name “Smith” or “Smythe”, while the words are different, the pronunciation is the same. The Soundex signature for both is the same “S530”.

Following are the steps to implement the algorithm:

  • Ignore all characters in the string being encoded except for the English letters, A to Z.

  • The first letter of the Soundex code is the first letter of the string being encoded.

  • After the first letter in the string, do not encode vowels or the letters H, W and Y.

  • Assign a numeric digit between one and six to all letters except the first using the following mappings:

    • 1: B, F, P or V

    • 2: C, G, J, K, Q, S, X, Z

    • 3: D, T

    • 4: L

    • 5: M, N

    • 6: R

  • Where any adjacent digits are the same, remove all but one of those digits unless a vowel, H, W or Y was found between them in the original text.

  • Force the code to be four characters in length by padding with zero characters or by truncation.

Here is an implementation in Python:

def get_soundex(name):
	"""Get the soundex code for the string"""
	name = name.upper()

	soundex = ""
	soundex += name[0]

	dictionary = {"BFPV": "1", "CGJKQSXZ":"2", "DT":"3", "L":"4", "MN":"5", "R":"6", "AEIOUHWY":"."}

	for char in name[1:]:
		for key in dictionary.keys():
			if char in key:
				code = dictionary[key]
				if code != soundex[-1]:
					soundex += code

	soundex = soundex.replace(".", "")
	soundex = soundex[:4].ljust(4, "0")				

	return soundex

if __name__ == '__main__':
	list = ["Smith", "Smythe", "Robert", "Rupert", "Schultz", "Shultz"]

	print("NAME\t\tSOUNDEX")
	for name in list:
		print("%s\t\t%s" % (name, get_soundex(name)))

You can also find an implementation in C#.

  • Ignore all characters in the string being encoded except for the English letters, A to Z.

  • The first letter of the Soundex code is the first letter of the string being encoded.

  • After the first letter in the string, do not encode vowels or the letters H, W and Y. These letters may affect the code by being present but are not encoded directly.

  • Assign a numeric digit between one and six to all letters except the first using the following mappings:

    • 1: B, F, P or V

    • 2: C, G, J, K, Q, S, X, Z

    • 3: D, T

    • 4: L

    • 5: M, N

    • 6: R

  • Where any adjacent digits are the same, remove all but one of those digits unless a vowel, H, W or Y was found between them in the original text. The C# code described below uses a temporary placeholder for these non-encodable letters to avoid incorrect removal of adjacent, matching digits.

  • Force the code to be four characters in length by padding with zero characters or by truncation.

 

Related Posts:

Tags: ,

Comments

4 Responses to “An implementation of the Soundex Algorithm in Python”
  1. mattia says:

    If you just replace “.” with “” why you just don’t insert in the dictionary the key “AEIOUHWY”?

    [Reply]

    Kevin Reply:

    “.” is inserted just as a placeholder so that we can remove duplicate code only if they are adjacent and not separated by “AEIOUHWY”. After this is done, the “.” placeholder is removed by replacing with “”

    In the above algorithm,
    Where any adjacent digits are the same, remove all but one of those digits unless a vowel, H, W or Y was found between them in the original text.

    [Reply]

    mattia Reply:

    Ok, got it. Thanks.

    [Reply]

Do you have any comments on An implementation of the Soundex Algorithm in Python ?

Spam protection by WP Captcha-Free