An implementation of the Soundex Algorithm in Python

February 14th, 2010 by Kevin | 4 Comments | Filed in programming

Soundex is a phonetic algorithm for indexing names by sound, as pronounced in English. The Soundex heuristic can be used for identifying names that sound alike but are spelled differently. The Soundex algorithm is a standard feature of MS SQL and Oracle database management systems to search for similar sounding names.

NAME SOUNDEX SIGNATURE
Smith S530
Smythe S530
Schultz S243
Shultz S432

Here if we are searching for the name “Smith” or “Smythe”, while the words are different, the pronunciation is the same. The Soundex signature for both is the same “S530”.

Following are the steps to implement the algorithm:

  • Ignore all characters in the string being encoded except for the English letters, A to Z.

  • The first letter of the Soundex code is the first letter of the string being encoded.

  • After the first letter in the string, do not encode vowels or the letters H, W and Y.

  • Assign a numeric digit between one and six to all letters except the first using the following mappings:

    • 1: B, F, P or V

    • 2: C, G, J, K, Q, S, X, Z

    • 3: D, T

    • 4: L

    • 5: M, N

    • 6: R

  • Where any adjacent digits are the same, remove all but one of those digits unless a vowel, H, W or Y was found between them in the original text.

  • Force the code to be four characters in length by padding with zero characters or by truncation.

Here is an implementation in Python:

def get_soundex(name):
	"""Get the soundex code for the string"""
	name = name.upper()

	soundex = ""
	soundex += name[0]

	dictionary = {"BFPV": "1", "CGJKQSXZ":"2", "DT":"3", "L":"4", "MN":"5", "R":"6", "AEIOUHWY":"."}

	for char in name[1:]:
		for key in dictionary.keys():
			if char in key:
				code = dictionary[key]
				if code != soundex[-1]:
					soundex += code

	soundex = soundex.replace(".", "")
	soundex = soundex[:4].ljust(4, "0")				

	return soundex

if __name__ == '__main__':
	list = ["Smith", "Smythe", "Robert", "Rupert", "Schultz", "Shultz"]

	print("NAME\t\tSOUNDEX")
	for name in list:
		print("%s\t\t%s" % (name, get_soundex(name)))

You can also find an implementation in C#.

  • Ignore all characters in the string being encoded except for the English letters, A to Z.

  • The first letter of the Soundex code is the first letter of the string being encoded.

  • After the first letter in the string, do not encode vowels or the letters H, W and Y. These letters may affect the code by being present but are not encoded directly.

  • Assign a numeric digit between one and six to all letters except the first using the following mappings:

    • 1: B, F, P or V

    • 2: C, G, J, K, Q, S, X, Z

    • 3: D, T

    • 4: L

    • 5: M, N

    • 6: R

  • Where any adjacent digits are the same, remove all but one of those digits unless a vowel, H, W or Y was found between them in the original text. The C# code described below uses a temporary placeholder for these non-encodable letters to avoid incorrect removal of adjacent, matching digits.

  • Force the code to be four characters in length by padding with zero characters or by truncation.

 

Related Posts:

Tags: ,

Can you rotate a vector in memory using minimal space and time?

February 10th, 2010 by Kevin | 1 Comment | Filed in programming

Continuing from my study of Programming Pearls, this is a problem which occurs when you want to swap memory with time and space constraints. An example of this is in case of text drag and drop, where the lines of text have to be swapped quite efficiently.

The problem is specified as:

Rotate a one-dimensional vector of n elements left by i positions. For instance, with n=8 and i=3, the vector abcdefgh is rotated to defghabc. Simple code uses an n-element intermediate vector to do the job in n steps. Can you rotate the vector in time proportional to n using only a few dozen bytes of storage?

Let us analyze the possible ways to solve this problem.

Solution 1:

  1. Copy the first i elements of the vector x to a temporary vector.
  2. Move the remaining elements of the vector (i to n) left by i positions.
  3. Copy the i elements from the temporary vector back to the end of the original vector x.
  4. The vector x is now rotated left by i positions.
#define N 10

int copyrotate(char* vector, int i)
{
   char temp[N] = "";

   memcpy(temp, vector, i);
   memcpy(vector, &vector[i], N - i);
   memcpy(&vector[N-i], temp, i);
   return 0;
}

This method requires a temporary vector which should have a size equal to the original. Hence it is space expensive. One way to consume less space is to rotate the vector by one element at a time and call this i times. However this will make it time expensive (n x i times).

Solution 2:

This method is based on the concept that rotation of elements of a vector is nothing but the reversal of the parts of the vector. Suppose we need to rotate vector x left by i positions. This vector can be represented by ab where a is the range (0, i-1) and b is the range (i, n-1). The vector x can be rotated left by i positions as follows:

  1. Reverse the part a of the vector containing the range (0, i-1).
  2. Reverse the part b of the vector containing the range (i, n-1).
  3. Now reverse the entire vector from (0, n-1).
  4. The vector x is now rotated left by i positions.
#define N 10

int reverse(char* vector, int lower, int upper)
{
   char temp;
   while(lower < upper)
   {
      temp = vector[lower];
      vector[lower] = vector[upper];
      vector[upper] = temp;
      lower++;
      upper--;
   }
   return 0;
}

int reverserotate(char* vector, int i)
{
   reverse(vector, 0, i-1);
   reverse(vector, i, N-1);
   reverse(vector, 0, N-1);
   return 0;
}

This solution has the advantage that it rotates the vector proportional to n. So it has reduced time and space requirements.

You can find the implementations for both the solutions here.

It is mentioned that Brian Kernighan and P. J. Plauger used this code in their 1981 Software Tools in Pascal to move lines within a text editor which was much simpler and bug free than one based on linked lists.

Do you agree that the second solution uses minimal time and space for this problem?

 

Related Posts:

Tags: ,

Can you sort 10 million numbers using 1MB of memory?

February 6th, 2010 by Kevin | 32 Comments | Filed in programming

For those of you who have not read the book, Programming Pearls (which you should), this could be new.

The problem is specified as:

Input: A file containing at most n positive integers, each less than n, where n=107. It is a fatal error if any integer occurs twice in the input. No other data is associated with the integer.

Output: A sorted list in increasing order of the input integers.

Constraints: At most  1 MB of storage is available in main memory; ample disk storage is available. The run time can be at most several minutes; a run time of ten seconds need not be decreased.

So how to solve this problem?

Solution 1:

The first and only solution I could think of was use some form of merge sort because merge sort uses a divide and merge technique which is naturally suited to use secondary memory to place the results of the partial sorting that it does. However implementing this is not a simple thing to do.

Solution 2:

Since we can use about 1MB of memory, we can process 250,000 integers at a time. So we divide the input numbers into range (0-24999, 250000-499999, …). The process to follow:

    1. Read the input file and get the numbers from a range into main memory.
    2. Sort the numbers in the range.
    3. Write the sorted numbers into the output file.
    4. Repeat this process till we have covered all the numbers in the file.
    5. The output file will contain the sorted numbers.

      The drawback of this method is that it requires you to read the input file about 40 times. So this method will require processing time in minutes due to I/O.

      Here is a sample implementation of this solution. Use records.c to generate the records.dat file which is used as the input file for rangesort.c.

      Solution 3:

      Since we know that each number is unique and within a range of 10 million numbers, we can use a bitmap vector to represent the numbers using one bit for each.

      Eg: If we have numbers 1, 5, 3, 10, … they will be represented by a bitmap vector 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, …

      1. Initialize the bitmap vector with zeroes.
      2. Read the input file and fill the bitmap vector, setting the bit to 1 for the corresponding number which we read.
      3. Read the bitmap vector and for each bit set to 1, write the corresponding number into the output file.
      4. The output file will contain the sorted numbers.

      Get a sample implementation of this solution here. Use records.c to generate the records.dat file which is used as the input file for bitmapsort.c.

      Looking forward to reading and trying out more of such interesting problems from Programming Pearls.

      This problem was due to a stringent constraint on main memory usage. Would you solve this problem any other way? Do you feel such constraints are actually faced by today’s computing devices?

       

      Related Posts:

      Tags: ,