SoundEx Algorithm
Module modSoundex
'' Soundex is an algorithm that converts words to an encoded string based on the way the word is pronounced.
'' This allows you to compare words based on pronunciation instead of binary matches.
''
'' For example, Zach and Zack are pronounced exactly the same way.
'' However, the string "Zach" does not equal the string "Zack"
'' But with Soundex have the same encoding:
'' Zach encodes as "Z200"
'' Zack also encodes as "Z200"
''
'' This works by breaking the string down into individual characters and assigning each character a value.
'' The rules for the algorithm are shown below:
''
''
'' 1. The first letter of the encoding will always be the first letter of the string
'' 2. Ignore the letters a, e, h, I, o, u, w, and y except in rule #1
'' 3. Use the table below to assign the remaining letters their encoding:
''
'' a. "1" should be assigned to b, f, p, and v
'' b. "2" should be assigned to c, g, j, k, q, s, x, and z
'' c. "3" should be assigned to d and t
'' d. "4" should be assigned to l
'' e. "5" should be assigned to m and n
'' f. "6" should be assigned to r
''
'' 4. Any letters with the same encoding that appear subsequently to each other should be ignored.
'' For instance the sequence "BB" would produce the encoding "1".
'' The sequence "BAB" would also produce the encoding "1" since A is an ignored character.
'' 5. The total length of the encoding must always be four.
'' If you have more than four characters in your encoding, trim the extra characters.
'' If you have less than four characters pad the encoding with zeroes to bring the length to four.
''
'' Examples using this algorithm:
'' Donald - D543
'' Zach - Z200
'' Campbel - C514
'' Cammmppppbbbeeelll - C514
'' David - D130
''
'' Another thing to note is that Soundex is NOT case sensitive.
'' So "ZACH" and "zAcH" both produce the same encoding.
''' <summary>
''' Soundex is an algorithm that converts words to an encoded string based on the way the word is pronounced.
''' </summary>
''' <param name="sData">Data to Encode as String</param>
''' <returns>Encoded Value as String</returns>
''' <remarks>Algorithm Matches SQL</remarks>
Public Function SoundEx( ByVal sData As String ) As String
Dim sb As Text.StringBuilder = New StringBuilder()
Dim result As String = String .Empty
If sData IsNot Nothing AndAlso sData.Length > 0 Then
Dim prevCode As String = ""
Dim curCode As String = ""
Dim curLetter As String = ""
sb.Append(sData.Substring(0, 1))
For i As Integer = 1 To sData.Length
curLetter = sData.Substring(i, 1).ToLower()
curCode = ""
If "bfpv" .IndexOf(curLetter) > - 1 Then '1 should be assigned to b, f, p, and v
curCode = "1"
ElseIf "cgjkqsxz" .IndexOf(curLetter) > - 1 Then '2 should be assigned to c, g, j, k, q, s, x, and z
curCode = "2"
ElseIf "dt" .IndexOf(curLetter) > - 1 Then '3 should be assigned to d and t
curCode = "3"
ElseIf curLetter = "l" Then '4 should be assigned to l
curCode = "4"
ElseIf "mn" .IndexOf(curLetter) > - 1 Then '5 should be assigned to m and n
curCode = "5"
ElseIf curLetter = "r" Then '6 should be assigned to r
curCode = "6"
Else
'ignore all other chars
End If
'no repeats
If curCode <> prevCode Then sb.Append(curCode)
'quit at 4 chars
If sb.Length = 4 Then Exit For
'track last value
If curCode <> "" Then prevCode = curCode
Next
End If
'tally result
result = sb.ToString.Length
'if not 4 long then pad right with zero
Return result.PadRight(4, "0" ).ToUpper
End Function
'' Calculating the difference between two Soundex codes
''
'' There are times when similar words do not have the same Soundex encoding.
'' Take, for example, the words "Lake" and "Bake".
'' These words sound very similar but the encoding is not the exactly the same.
'' Because of this, Microsoft SQL Server includes a function called "Difference" that will give you a ranking on how similar two words are.
'' The rank is from four to one, with four being a perfect match and one representing no match at all.
''
'' Unlike the Soundex algorithm, the Difference function does not use a published formula to determine the ranking.
'' These are rules I have developed, and they seem to closely match the rules used by the Difference function in SQL:
''
'' 1. Run both words through Soundex
'' 2. If the Soundex encodings are the same the rank will be 4. If the Soundex encodings are not the same, continue to step 3.
'' 3. If the last three characters of the first encoding are found in the second encoding the rank will be 3. Skip to step 7.
'' 4. If the last two characters of the first encoding are found in the second encoding the rank will be 2. Skip to step 7.
'' 5. If the middle two characters of the first encoding are found in the second encoding, the rank will be 2. Skip to step 7.
'' 6. If the second, third, or fourth characters of the first encoding are found within the second encoding add 1 to the rank for each character that matches.
'' 7. If the first character of the first encoding matches the first character of the second encoding, add 1 to the current rank.
''
'' These rules are approximate, but they produce very similar results to the SQL Difference function.
'' Several example results are shown below, all off which return the same result in SQL:
'' Zach & Zac - 4
'' Lake & Bake - 3
'' Brad & Lad - 2
'' Horrible & Great - 1
''' <summary>
''' Calculating the difference between two Soundex codes
''' </summary>
''' <param name="sData1">Control Data as String</param>
''' <param name="sData2">Data to Compare as String</param>
''' <returns>Rank as Integer</returns>
''' <remarks>The rank is from four to one, with four being a perfect match and one representing no match at all.</remarks>
Public Function Difference( ByVal sData1 As String , ByVal sData2 As String ) As Integer
Dim result As Integer = 0
'Run both words through Soundex
Dim soundex1 As String = SoundEx(sData1)
Dim soundex2 As String = SoundEx(sData2)
'If the Soundex encodings are the same the rank will be 4
If soundex1.CompareTo(soundex2) = 0 Then
result = 4
Else
Dim sub1 As String = soundex1.Substring(1, 3)
Dim sub2 As String = soundex1.Substring(2, 2)
Dim sub3 As String = soundex1.Substring(1, 2)
Dim sub4 As String = soundex1.Substring(1, 1)
Dim sub5 As String = soundex1.Substring(2, 1)
Dim sub6 As String = soundex1.Substring(3, 1)
If soundex2.IndexOf(sub1) > - 1 Then 'If the last three characters of the first encoding are found in the second encoding the rank will be 3
result = 3
ElseIf soundex2.IndexOf(sub2) > - 1 Then 'If the last two characters of the first encoding are found in the second encoding the rank will be 2
result = 2
ElseIf soundex2.IndexOf(sub3) > - 1 Then 'If the middle two characters of the first encoding are found in the second encoding, the rank will be 2
result = 2
Else
'If the second, third, or fourth characters of the first encoding
'are found within the second encoding
'add 1 to the rank for each character that matches
If soundex2.IndexOf(sub4) > - 1 Then result += 1
If soundex2.IndexOf(sub5) > - 1 Then result += 1
If soundex2.IndexOf(sub6) > - 1 Then result += 1
End If
'If the first character of the first encoding matches the first character of the second encoding, add 1 to the current rank
If soundex1.Substring(0, 1) = soundex2.Substring(0, 1) Then result += 1
End If
'lowest rank is 1
If result = 0 Then result += 1
Return result
End Function
End Module