Python Fuzzy Matching Of Names With Only First Initials
Solution 1:
Although @Igle's commment does solve this specific problem, I want to stress that this is a narrow solution that won't necessarily work for everything. Fuzzywuzzy has multiple scorers that use the Levenshtein distance algorithm combined with different logic to compare strings. The default scorer, fuzz.WRatio, compares the matching score of the straight Levenshtein distance algorithm (fuzz.ratio) with other variants, and returns the best match from all of the scorers. There's more to it than just that, including additional logic around weighting the score from different methods, if you're interested I suggest looking at the source code for fuzz.WRatio.
To see what's happening in your case, you can compare the scores for all the choices across scorers by slightly adapting the last lines of your code:
For token_set_ratio:
for i in names:
s = process.extract(i, choices,scorer=fuzz.token_set_ratio)
print(s, i)
[('Gonzalez Ryan', 89), ('Gonzalez Eddy', 89), ('Kody Clemens', 27), ('Kacy Clemens', 27)] Gonzalez, E.
[('Gonzalez Ryan', 89), ('Gonzalez Eddy', 89), ('Kody Clemens', 27), ('Kacy Clemens', 27)] Gonzalez, R.
[('Kody Clemens', 91), ('Kacy Clemens', 82), ('Gonzalez Ryan', 26), ('Gonzalez Eddy', 26)] Clemens, Ko.
[('Kacy Clemens', 91), ('Kody Clemens', 82), ('Gonzalez Ryan', 35), ('Gonzalez Eddy', 26)] Clemens, Ka.
For token_sort_ratio:
for i in names:
s = process.extract(i, choices,scorer=fuzz.token_sort_ratio)
print(s, i)
[('Gonzalez Eddy', 87), ('Gonzalez Ryan', 70), ('Kody Clemens', 27), ('Kacy Clemens', 27)] Gonzalez, E.
[('Gonzalez Ryan', 87), ('Gonzalez Eddy', 70), ('Kody Clemens', 27), ('Kacy Clemens', 27)] Gonzalez, R.
[('Kody Clemens', 91), ('Kacy Clemens', 82), ('Gonzalez Ryan', 26), ('Gonzalez Eddy', 26)] Clemens, Ko.
[('Kacy Clemens', 91), ('Kody Clemens', 82), ('Gonzalez Ryan', 35), ('Gonzalez Eddy', 26)] Clemens, Ka.
Although token_sort_ratio shows a clear winning match, token_set_ratio returns higher scores which is how fuzz.WRatio picks what result it returns. Another major issue is that when you have such similar queries and choices, the order in which they are compared starts to matter. For example, when I run the exact same code as above, but reverse the order of the choices list we get 'Gonzalez Eddy' for both:
for i in names:
s = process.extract(i, choices[::-1],scorer=fuzz.token_set_ratio)
print(s, i)
[('Gonzalez Eddy', 89), ('Gonzalez Ryan', 89), ('Kacy Clemens', 27), ('Kody Clemens', 27)] Gonzalez, E.
[('Gonzalez Eddy', 89), ('Gonzalez Ryan', 89), ('Kacy Clemens', 27), ('Kody Clemens', 27)] Gonzalez, R.
[('Kody Clemens', 91), ('Kacy Clemens', 82), ('Gonzalez Eddy', 26), ('Gonzalez Ryan', 26)] Clemens, Ko.
[('Kacy Clemens', 91), ('Kody Clemens', 82), ('Gonzalez Ryan', 35), ('Gonzalez Eddy', 26)] Clemens, Ka.
I'm guessing that the correct match actually has a higher score, but 'Eddy' and 'Ryan' are close enough to both round to the same final score.
Ways I've dealt with similar issues in the past:
- Use extract instead of extractOne (like I did in the examples above)
- Process the same queries/choices with multiple scorers (ratio, token_set_ratio, token_sort_ratio) and use a weighted average of these scores to pick your best match.
- Adjust the fuzzywuzzy source code to incorporate custom weighting or remove the rounding.
Post a Comment for "Python Fuzzy Matching Of Names With Only First Initials"