Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Merging csv files with different headers with Pandas in Python

I'm trying to map a dataset to a blank CSV file with different headers, so I'm essentially trying to map data from one CSV file which has different headers to a new CSV with different amount of headers and called different things, the reason this question is different is since the column names aren't the same but there are no overlapping columns either. And I can't overwrite the data file with new headers since the data file has other columns with irrelevant data, I'm certain I'm overcomplicating this.

I've seen this example code but how do I change this since this example is using a common header to join the data.

a = pd.read_csv("a.csv")
b = pd.read_csv("b.csv")
#a.csv = ID TITLE
#b.csv = ID NAME
b = b.dropna(axis=1)
merged = a.merge(b, on='title')
merged.to_csv("output.csv", index=False)

Sample Data

a.csv (blank format file, the format must match this file):

Headers: TOWN NAME LOCATION HEIGHT STAR

b.csv:

Headers: COUNTRY WEIGHT  NAME  AGE MEASUREMENT
 Data:    UK,     150lbs, John, 6,  6ft

Expected output file:

Headers: TOWN    NAME   LOCATION  HEIGHT  STAR
Data:    (Blank) John,  UK,       6ft    (Blank)
like image 914
MF DOOM Avatar asked Mar 12 '20 08:03

MF DOOM


People also ask

How do I merge multiple CSV files into pandas?

To merge all CSV files, use the GLOB module. The os. path. join() method is used inside the concat() to merge the CSV files together.


1 Answers

From your example, it looks like you need to do some column renaming in addition to the merge. This is easiest done before the merge itself.

# Read the csv files
dfA = pd.read_csv("a.csv")
dfB = pd.read_csv("b.csv")

# Rename the columns of b.csv that should match the ones in a.csv
dfB = dfB.rename(columns={'MEASUREMENT': 'HEIGHT', 'COUNTRY': 'LOCATION'})

# Merge on all common columns
df = pd.merge(dfA, dfB, on=list(set(dfA.columns) & set(dfB.columns)), how='outer')

# Only keep the columns that exists in a.csv
df = df[dfA.columns]

# Save to a new csv
df.to_csv("output.csv", index=False)

This should give you what you are after.

like image 176
Shaido Avatar answered Nov 14 '22 23:11

Shaido