Python fuzzy string matching as correlation style table/matrix

Tags:

I have a file with x number of string names and their associated IDs. Essentially two columns of data.

What I would like, is a correlation style table with the format x by x (having the data in question both as the x-axis and y axis), but instead of correlation, I would like the fuzzywuzzy library's function fuzz.ratio(x,y) as the output using the string names as input. Essentially running every entry against every entry.

This is sort of what I had in mind. Just to show my intent:

import pandas as pd
from fuzzywuzzy import fuzz

df = pd.read_csv('random_data_file.csv')

df = df[['ID','String']]
df['String_Dup'] = df['String'] #creating duplicate of data in question
df = df.set_index('ID')

df = df.groupby('ID')[['String','String_Dup']].apply(fuzz.ratio())

But clearly this approach is not working for me at the moment. Any help appreciated. It doesn't have to be pandas, it is just an environment I am relatively more familiar with.

I hope my issue is clearly worded, and really, any input is appreciated,

291

asked Nov 12 '18 11:11

WayOutofDepth

1 Answers

Use pandas' crosstab function, followed by a column-wise apply to compute the fuzz. This is considerably more elegant than my first answer.

import pandas as pd
from fuzzywuzzy import fuzz

# Create sample data frame.
df = pd.DataFrame([(1, 'abracadabra'), (2,'abc'), (3,'cadra'), (4, 'brabra')],
                  columns=['id', 'strings'])
# Create the cartesian product between the strings column with itself.
ct = pd.crosstab(df['strings'], df['strings'])
# Note: for pandas versions <0.22, the two series must have different names.
# In case you observe a "Level XX not found" error, the following may help:
# ct = pd.crosstab(df['strings'].rename(), df['strings'].rename())

# Apply the fuzz (column-wise). Argument col has type pd.Series.
ct = ct.apply(lambda col: [fuzz.ratio(col.name, x) for x in col.index])

# This results in the following:
#       strings      abc  abracadabra  brabra  cadra
#       strings
#       abc          100           43      44     25
#       abracadabra   43          100      71     62
#       brabra        44           71     100     55
#       cadra         25           62      55    100

For simplicity, I omitted the groupby operation as suggested in your question. In case need want to apply the fuzzy string matching on groups, simply create a separate function:

def cross_fuzz(df):
    ct = pd.crosstab(df['strings'], df['strings'])
    ct = ct.apply(lambda col: [fuzz.ratio(col.name, x) for x in col.index])
    return ct

df.groupby('id').apply(cross_fuzz)

159

answered Nov 10 '22 00:11

normanius

Related questions
                            
                                How can we get path params in falcon middleware, if any path param in the route?
                            
                                how to execute a function each time you run python shell
                            
                                Turn off 'Python version 3.5 does not support variable annotations' error message
                            
                                How to get the current locale's alphabet in Python 3?
                            
                                How to import object from builtins affecting just one class?
                            
                                Pandas Dataframe Parquet Data Types?
                            
                                How to combine the phase of one image and magnitude of different image into 1 image by using python
                            
                                How to create a type that is closed under inherited operations?
                            
                                Keras floods Jupyter cell output during fit (verbose=1)
                            
                                How to create an abstract subclass of a concrete superclass in Python 3?
                            
                                Multiprocessing slower than serial processing in Windows (but not in Linux)
                            
                                Use __init__.py to modify sys path is a good idea?
                            
                                Permission Check Discord.py Bot
                            
                                can't understand [Errno 111] Connection refused
                            
                                Why doesn't tkinter release memory when an instance is destroyed?
                            
                                Multiprocessing large XML file with shared memory complex objects
                            
                                Tkinter button expand using grid
                            
                                How do I create a seaborn line plot for PySpark dataframe?
                            
                                OpenSSL: error:1409442E:SSL routines:ssl3_read_bytes:tlsv1 alert protocol version
                            
                                AttributeError when training CNN 1D with Python Keras

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Python fuzzy string matching as correlation style table/matrix

Tags:

python

pandas

matrix

fuzzy

WayOutofDepth

People also ask

1 Answers

normanius

Recent Activity

Donate For Us