Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Compare two dataframe columns for matching percentage

I want to compare a data frame of one column with another data frame of multiple columns and return the header of the column having maximum match percentage.

I am not able to find any match functions in pandas. First data frame first column :

cars
----   
swift   
maruti   
wagonor  
hyundai  
jeep

First data frame second column :

bikes
-----
RE
Ninja
Bajaj
pulsar

one column data frame :

words
---------
swift 
RE 
maruti
waganor
hyundai
jeep
bajaj

Desired output :

100% match  header - cars
like image 464
surya narayan Avatar asked Jun 17 '19 07:06

surya narayan


2 Answers

Try to use isin function of pandas DataFrame. Assuming df is your first dataframe and words is a list :

In[1]: (df.isin(words).sum()/df.shape[0])*100
Out[1]:
cars     100.0
bikes     20.0
dtype: float64

You may need to lowercase strings in your df and in the words list to avoid any casing issue.

like image 159
Lawis Avatar answered Oct 09 '22 08:10

Lawis


You can first get the columns into lists:

dfCarsList = df['cars'].tolist()
dfWordsList = df['words'].tolist()
dfBikesList = df['Bikes'].tolist()

And then iterate of the list for comparision:

numberCars = sum(any(m in L for m in dfCarsList) for L in dfWordsList)
numberBikes = sum(any(m in L for m in dfBikesList) for L in dfWordsList)

The higher number you can use than for your output.

like image 1
PV8 Avatar answered Oct 09 '22 07:10

PV8