Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Case insensitive pandas dataframe.merge

Tags:

python

pandas

csv

I am struggling with the easiest way to do a case insensitive merge in pandas. Is there a way to do it right on the merge? Do I need to use (?i) or a regex with ignorecase? In my code snippet below I am joining some Countries where it may be "United States" in one file and "UNITED STATES" in another and I just want to take the case out of the equation. Thank you!

import pandas as pd
import csv
import sys

env_path = sys.argv[1]
map_path = sys.argv[2]


df_address = pd.read_csv(env_path + "\\address.csv")
df_CountryMapping = pd.read_csv(map_path + "\CountryMapping.csv")

df_merged = df_address.merge(df_CountryMapping, left_on="Country", right_on="NAME", how="left")

....
like image 950
EMC Avatar asked Apr 21 '15 02:04

EMC


People also ask

Is pandas merge case sensitive?

pandas. DataFrame. merge (similar to a SQL join) is case sensitive, as are most Python functions.

What is difference between pandas concat and merge?

merge() for combining data on common columns or indices. . join() for combining data on a key column or an index. concat() for combining DataFrames across rows or columns.

Is merge faster than join pandas?

A merge is also just as efficient as a join as long as: Merging is done on indexes if possible. The “on” parameter is avoided, and instead, both columns to merge on are explicitly stated using the keywords left_on, left_index, right_on, and right_index (when applicable).

How is pandas merge so fast?

It generates two frames with a million rows each, in random order. Then it generates two more that have been sorted on the first column. Then it merges the first two, and last, merges the second two.


4 Answers

Lowercase the values in the two columns that will be used to merge, and then merge on the lowercased columns

df_address['country_lower'] = df_address['Country'].str.lower()
df_CountryMapping['name_lower'] = df_CountryMapping['NAME'].str.lower()
df_merged = df_address.merge(df_CountryMapping, left_on="country_lower", right_on="name_lower", how="left")
like image 188
Shashank Agarwal Avatar answered Oct 15 '22 05:10

Shashank Agarwal


df_merged = pd.merge(df_address, df_CountryMapping, left_on=df_address["Country"].str.lower(), right_on=df_CountryMapping["NAME"].str.lower(), how="left")
like image 39
dattatreya moganti Avatar answered Oct 15 '22 05:10

dattatreya moganti


I suggest lowering the column names after reading them

df_address.columns=[c.lower() for c in df_address.columns]
df_CountryMapping.columns=[c.lower() for c in df_CountryMapping.columns]

Then update the values

df_address['country']=df_address['country'].str.lower()
df_CountryMapping['name']=df_CountryMapping['name'].str.lower()

And only then, do the merging

df_merged = df_address.merge(df_CountryMapping, left_on="country", right_on="name", how="left")
like image 43
Uri Goren Avatar answered Oct 15 '22 05:10

Uri Goren


One solution would be to convert the column names of both data frames to be all lowercase. So something like this:

df_address = pd.read_csv(env_path + "\\address.csv")
df_CountryMapping = pd.read_csv(map_path + "\CountryMapping.csv")

df_address.rename(columns=lambda x: x.lower(), inplace=True)
df_CountryMapping.rename(columns=lambda x: x.lower(), inplace=True)

df_merged = df_address.merge(df_CountryMapping, left_on="country", right_on="name", how="left")
like image 20
mway Avatar answered Oct 15 '22 04:10

mway