RealWorldDevelopers
5/25/2016 - 1:23 AM

SoundEx Algorithm

SoundEx Algorithm

Module   modSoundex  

      ''  Soundex is an algorithm that converts words to an encoded string based on the way the word is pronounced.  
      ''  This allows you to compare words based on pronunciation instead of binary matches.  
      ''  
      ''  For example, Zach and Zack are pronounced exactly the same way.  
      ''  However, the string "Zach" does not equal the string "Zack"  
      ''  But with Soundex have the same encoding:  
      ''          Zach encodes as "Z200"  
      ''          Zack also encodes as "Z200"  
      ''  
      ''  This works by breaking the string down into individual characters and assigning each character a value.  
      ''  The rules for the algorithm are shown below:  
      ''  
      ''  
      ''  1. The first letter of the encoding will always be the first letter of the string  
      ''  2. Ignore the letters a, e, h, I, o, u, w, and y except in rule #1  
      ''  3. Use the table below to assign the remaining letters their encoding:  
      ''  
      ''       a. "1" should be assigned to b, f, p, and v  
      ''       b. "2" should be assigned to c, g, j, k, q, s, x, and z  
      ''       c. "3" should be assigned to d and t  
      ''       d. "4" should be assigned to l  
      ''       e. "5" should be assigned to m and n  
      ''       f. "6" should be assigned to r  
      ''  
      ''  4. Any letters with the same encoding that appear subsequently to each other should be ignored.  
      ''       For instance the sequence "BB" would produce the encoding "1".  
      ''       The sequence "BAB" would also produce the encoding "1" since A is an ignored character.  
      ''  5. The total length of the encoding must always be four.  
      ''       If you have more than four characters in your encoding, trim the extra characters.  
      ''       If you have less than four characters pad the encoding with zeroes to bring the length to four.  
      ''  
      ''  Examples using this algorithm:  
      ''       Donald - D543  
      ''       Zach - Z200  
      ''       Campbel - C514  
      ''       Cammmppppbbbeeelll - C514  
      ''       David - D130  
      ''  
      ''  Another thing to note is that Soundex is NOT case sensitive.  
      ''  So "ZACH" and "zAcH" both produce the same encoding.  

      ''' <summary>  
      ''' Soundex is an algorithm that converts words to an encoded string based on the way the word is pronounced.  
      ''' </summary>  
      ''' <param name="sData">Data to Encode as String</param>  
      ''' <returns>Encoded Value as String</returns>  
      ''' <remarks>Algorithm Matches SQL</remarks>  
      Public    Function   SoundEx(   ByVal   sData   As    String   )   As    String  

         Dim   sb   As   Text.StringBuilder   =    New   StringBuilder()  
         Dim   result   As    String    =    String   .Empty  

         If   sData IsNot   Nothing    AndAlso   sData.Length   >   0   Then  

            Dim   prevCode   As    String    =    ""  
            Dim   curCode   As    String    =    ""  
            Dim   curLetter   As    String    =    ""  

         sb.Append(sData.Substring(0, 1))  

            For   i   As    Integer    =   1   To   sData.Length  

            curLetter   =   sData.Substring(i, 1).ToLower()  
            curCode   =    ""  

               If    "bfpv"   .IndexOf(curLetter)   >    -   1   Then    '1 should be assigned to b, f, p, and v  
               curCode   =    "1"  
               ElseIf    "cgjkqsxz"   .IndexOf(curLetter)   >    -   1   Then    '2 should be assigned to c, g, j, k, q, s, x, and z  
               curCode   =    "2"  
               ElseIf    "dt"   .IndexOf(curLetter)   >    -   1   Then    '3 should be assigned to d and t  
               curCode   =    "3"  
               ElseIf   curLetter   =    "l"    Then    '4 should be assigned to l  
               curCode   =    "4"  
               ElseIf    "mn"   .IndexOf(curLetter)   >    -   1   Then    '5 should be assigned to m and n  
               curCode   =    "5"  
               ElseIf   curLetter   =    "r"    Then    '6 should be assigned to r  
               curCode   =    "6"  
               Else  
                  'ignore all other chars  
               End If  

               'no repeats  
               If   curCode   <>   prevCode   Then   sb.Append(curCode)  

               'quit at 4 chars  
               If   sb.Length   =   4   Then    Exit    For  

               'track last value  
               If   curCode   <>    ""    Then   prevCode   =   curCode  

            Next  


         End If  

         'tally result  
      result   =   sb.ToString.Length  

         'if not 4 long then pad right with zero  
         Return   result.PadRight(4,   "0"   ).ToUpper  

   End   Function  

      ''    Calculating the difference between two Soundex codes  
      ''  
      ''    There are times when similar words do not have the same Soundex encoding.  
      ''    Take, for example, the words "Lake" and "Bake".  
      ''    These words sound very similar but the encoding is not the exactly the same.  
      ''    Because of this, Microsoft SQL Server includes a function called "Difference" that will give you a ranking on how similar two words are.  
      ''    The rank is from four to one, with four being a perfect match and one representing no match at all.  
      ''  
      ''    Unlike the Soundex algorithm, the Difference function does not use a published formula to determine the ranking.  
      ''    These are rules I have developed, and they seem to closely match the rules used by the Difference function in SQL:  
      ''  
      ''       1. Run both words through Soundex  
      ''       2. If the Soundex encodings are the same the rank will be 4. If the Soundex encodings are not the same, continue to step 3.  
      ''       3. If the last three characters of the first encoding are found in the second encoding the rank will be 3. Skip to step 7.  
      ''       4. If the last two characters of the first encoding are found in the second encoding the rank will be 2. Skip to step 7.  
      ''       5. If the middle two characters of the first encoding are found in the second encoding, the rank will be 2. Skip to step 7.  
      ''       6. If the second, third, or fourth characters of the first encoding are found within the second encoding add 1 to the rank for each character that matches.  
      ''       7. If the first character of the first encoding matches the first character of the second encoding, add 1 to the current rank.  
      ''  
      ''    These rules are approximate, but they produce very similar results to the SQL Difference function.  
      ''    Several example results are shown below, all off which return the same result in SQL:  
      ''       Zach & Zac - 4  
      ''       Lake & Bake - 3  
      ''       Brad & Lad - 2  
      ''       Horrible & Great - 1  

      ''' <summary>  
      ''' Calculating the difference between two Soundex codes  
      ''' </summary>  
      ''' <param name="sData1">Control Data as String</param>  
      ''' <param name="sData2">Data to Compare as String</param>  
      ''' <returns>Rank as Integer</returns>  
      ''' <remarks>The rank is from four to one, with four being a perfect match and one representing no match at all.</remarks>  
      Public    Function   Difference(   ByVal   sData1   As    String   ,   ByVal   sData2   As    String   )   As    Integer  

         Dim   result   As    Integer    =   0  

         'Run both words through Soundex  
         Dim   soundex1   As    String    =   SoundEx(sData1)  
         Dim   soundex2   As    String    =   SoundEx(sData2)  

         'If the Soundex encodings are the same the rank will be 4  
         If   soundex1.CompareTo(soundex2)   =   0   Then  
         result   =   4  
         Else  
            Dim   sub1   As    String    =   soundex1.Substring(1, 3)  
            Dim   sub2   As    String    =   soundex1.Substring(2, 2)  
            Dim   sub3   As    String    =   soundex1.Substring(1, 2)  
            Dim   sub4   As    String    =   soundex1.Substring(1, 1)  
            Dim   sub5   As    String    =   soundex1.Substring(2, 1)  
            Dim   sub6   As    String    =   soundex1.Substring(3, 1)  

            If   soundex2.IndexOf(sub1)   >    -   1   Then    'If the last three characters of the first encoding are found in the second encoding the rank will be 3  
            result   =   3  
            ElseIf   soundex2.IndexOf(sub2)   >    -   1   Then    'If the last two characters of the first encoding are found in the second encoding the rank will be 2  
            result   =   2  
            ElseIf   soundex2.IndexOf(sub3)   >    -   1   Then    'If the middle two characters of the first encoding are found in the second encoding, the rank will be 2  
            result   =   2  
            Else  
               'If the second, third, or fourth characters of the first encoding  
               'are found within the second encoding  
               'add 1 to the rank for each character that matches  
               If   soundex2.IndexOf(sub4)   >    -   1   Then   result   +=   1  
               If   soundex2.IndexOf(sub5)   >    -   1   Then   result   +=   1  
               If   soundex2.IndexOf(sub6)   >    -   1   Then   result   +=   1  
            End If  

            'If the first character of the first encoding matches the first character of the second encoding, add 1 to the current rank  
            If   soundex1.Substring(0, 1)   =   soundex2.Substring(0, 1)   Then   result   +=   1  

         End If  

         'lowest rank is 1  
         If   result   =   0   Then   result   +=   1  
       
         Return   result  

      End Function  


 End   Module