I am struggling with the easiest way to do a case insensitive merge in pandas. Is there a way to do it right on the merge? Do I need to use (?i) or a regex with ignorecase? In my code snippet below I am joining some Countries where it may be "United States" in one file and "UNITED STATES" in another and I just want to take the case out of the equation. Thank you!
import pandas as pd
import csv
import sys
env_path = sys.argv[1]
map_path = sys.argv[2]
df_address = pd.read_csv(env_path + "\\address.csv")
df_CountryMapping = pd.read_csv(map_path + "\CountryMapping.csv")
df_merged = df_address.merge(df_CountryMapping, left_on="Country", right_on="NAME", how="left")
....
pandas. DataFrame. merge (similar to a SQL join) is case sensitive, as are most Python functions.
merge() for combining data on common columns or indices. . join() for combining data on a key column or an index. concat() for combining DataFrames across rows or columns.
A merge is also just as efficient as a join as long as: Merging is done on indexes if possible. The “on” parameter is avoided, and instead, both columns to merge on are explicitly stated using the keywords left_on, left_index, right_on, and right_index (when applicable).
It generates two frames with a million rows each, in random order. Then it generates two more that have been sorted on the first column. Then it merges the first two, and last, merges the second two.
Lowercase the values in the two columns that will be used to merge, and then merge on the lowercased columns
df_address['country_lower'] = df_address['Country'].str.lower()
df_CountryMapping['name_lower'] = df_CountryMapping['NAME'].str.lower()
df_merged = df_address.merge(df_CountryMapping, left_on="country_lower", right_on="name_lower", how="left")
df_merged = pd.merge(df_address, df_CountryMapping, left_on=df_address["Country"].str.lower(), right_on=df_CountryMapping["NAME"].str.lower(), how="left")
I suggest lowering the column names after reading them
df_address.columns=[c.lower() for c in df_address.columns]
df_CountryMapping.columns=[c.lower() for c in df_CountryMapping.columns]
Then update the values
df_address['country']=df_address['country'].str.lower()
df_CountryMapping['name']=df_CountryMapping['name'].str.lower()
And only then, do the merging
df_merged = df_address.merge(df_CountryMapping, left_on="country", right_on="name", how="left")
One solution would be to convert the column names of both data frames to be all lowercase. So something like this:
df_address = pd.read_csv(env_path + "\\address.csv")
df_CountryMapping = pd.read_csv(map_path + "\CountryMapping.csv")
df_address.rename(columns=lambda x: x.lower(), inplace=True)
df_CountryMapping.rename(columns=lambda x: x.lower(), inplace=True)
df_merged = df_address.merge(df_CountryMapping, left_on="country", right_on="name", how="left")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With