Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Ignore case with difflib.get_close_matches()

Tags:

python

difflib

How can I tell difflib.get_close_matches() to ignore case? I have a dictionary which has a defined format which includes capitalisation. However, the test string might have full capitalisation or no capitalisation, and these should be equivalent. The results need to be properly capitalised, however, so I can't use a modified dictionary.

import difflib

names = ['Acacia koa A.Gray var. latifolia (Benth.) H.St.John',
    'Acacia koa A.Gray var. waianaeensis H.St.John',
    'Acacia koaia Hillebr.',
    'Acacia kochii W.Fitzg. ex Ewart & Jean White',
    'Acacia kochii W.Fitzg.']
s = 'Acacia kochi W.Fitzg.'

# base case: proper capitalisation
print(difflib.get_close_matches(s,names,1,0.9))

# this should be equivalent from the perspective of my program
print(difflib.get_close_matches(s.upper(),names,1,0.9))

# this won't work because of the dictionary formatting
print(difflib.get_close_matches(s.upper().capitalize(),names,1,0.9))

Output:

['Acacia kochii W.Fitzg.']
[]
[]

Working code:

Based on Hugh Bothwell's answer, I have modified the code as follows to get a working solution (which should also work when more than one result is returned):

import difflib

names = ['Acacia koa A.Gray var. latifolia (Benth.) H.St.John',
    'Acacia koa A.Gray var. waianaeensis H.St.John',
    'Acacia koaia Hillebr.',
    'Acacia kochii W.Fitzg. ex Ewart & Jean White',
    'Acacia kochii W.Fitzg.']
test = {n.lower():n for n in names}    
s1 = 'Acacia kochi W.Fitzg.'   # base case
s2 = 'ACACIA KOCHI W.FITZG.'   # test case

results = [test[r] for r in difflib.get_close_matches(s1.lower(),test,1,0.9)]
results += [test[r] for r in difflib.get_close_matches(s2.lower(),test,1,0.9)]
print results

Output:

['Acacia kochii W.Fitzg.', 'Acacia kochii W.Fitzg.']
like image 645
rudivonstaden Avatar asked Jul 08 '12 16:07

rudivonstaden


1 Answers

I don't see any quick way to make difflib do case-insensitive comparison.

The quick-and-dirty solution seems to be

  • make a function that converts the string to some canonical form (for example: upper case, single spaced, no punctuation)

  • use that function to make a dict of {canonical string: original string} and a list of [canonical string]

  • run .get_close_matches against the canonical-string list, then plug the results through the dict to get the original strings back

like image 99
Hugh Bothwell Avatar answered Sep 30 '22 09:09

Hugh Bothwell