Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python fuzzy string matching as correlation style table/matrix

I have a file with x number of string names and their associated IDs. Essentially two columns of data.

What I would like, is a correlation style table with the format x by x (having the data in question both as the x-axis and y axis), but instead of correlation, I would like the fuzzywuzzy library's function fuzz.ratio(x,y) as the output using the string names as input. Essentially running every entry against every entry.

This is sort of what I had in mind. Just to show my intent:

import pandas as pd
from fuzzywuzzy import fuzz

df = pd.read_csv('random_data_file.csv')

df = df[['ID','String']]
df['String_Dup'] = df['String'] #creating duplicate of data in question
df = df.set_index('ID')

df = df.groupby('ID')[['String','String_Dup']].apply(fuzz.ratio())

But clearly this approach is not working for me at the moment. Any help appreciated. It doesn't have to be pandas, it is just an environment I am relatively more familiar with.

I hope my issue is clearly worded, and really, any input is appreciated,

like image 291
WayOutofDepth Avatar asked Nov 12 '18 11:11

WayOutofDepth


People also ask

What is FuzzyWuzzy python?

FuzzyWuzzy is a library of Python which is used for string matching. Fuzzy string matching is the process of finding strings that match a given pattern. Basically it uses Levenshtein Distance to calculate the differences between sequences.

How do you match a string in python?

Exact match (equality comparison): == , != As with numbers, the == operator determines if two strings are equal. If they are equal, True is returned; if they are not, False is returned. It is case-sensitive, and the same applies to comparisons by other operators and methods.

What is Fuzzystrmatch?

The fuzzystrmatch module provides two functions for working with Soundex codes: soundex(text) returns text difference(text, text) returns int. The soundex function converts a string to its Soundex code.


1 Answers

Use pandas' crosstab function, followed by a column-wise apply to compute the fuzz. This is considerably more elegant than my first answer.

import pandas as pd
from fuzzywuzzy import fuzz

# Create sample data frame.
df = pd.DataFrame([(1, 'abracadabra'), (2,'abc'), (3,'cadra'), (4, 'brabra')],
                  columns=['id', 'strings'])
# Create the cartesian product between the strings column with itself.
ct = pd.crosstab(df['strings'], df['strings'])
# Note: for pandas versions <0.22, the two series must have different names.
# In case you observe a "Level XX not found" error, the following may help:
# ct = pd.crosstab(df['strings'].rename(), df['strings'].rename())

# Apply the fuzz (column-wise). Argument col has type pd.Series.
ct = ct.apply(lambda col: [fuzz.ratio(col.name, x) for x in col.index])

# This results in the following:
#       strings      abc  abracadabra  brabra  cadra
#       strings
#       abc          100           43      44     25
#       abracadabra   43          100      71     62
#       brabra        44           71     100     55
#       cadra         25           62      55    100

For simplicity, I omitted the groupby operation as suggested in your question. In case need want to apply the fuzzy string matching on groups, simply create a separate function:

def cross_fuzz(df):
    ct = pd.crosstab(df['strings'], df['strings'])
    ct = ct.apply(lambda col: [fuzz.ratio(col.name, x) for x in col.index])
    return ct

df.groupby('id').apply(cross_fuzz)
like image 159
normanius Avatar answered Nov 10 '22 00:11

normanius