I am concatenating all files in a directory together into one, however some files have a different number of entries - how to I place an NaN when there isn't a value for that key in the file?
For example:
file1.cs
NUM, NAME, ORG, DATA
1,AAA,10,123.4
1,AAB,20,176.5
1,AAC,30,133.5
file2. cs
NUM, NAME, ORG, DATA
1,AAA,10,111.4
1,AAC,30,122.5
2,BBA,12,156.7
Desired Output
NUM, NAME, ORG, File1, File2 ....
1, AAA, 10, 123.4, 111.4
1, AAB, 20, 176.5, NaN
1, AAC, 30, 133.5, 122.5
2, BBA, 12, NaN, 156.7
.....
This is what I've tried:
import pandas as pd
import glob
writer = pd.ExcelWriter('analysis.xlsx', engine='xlsxwriter')
data = []
df1 = pd.read_csv("file1.cs", sep = ',', header = 'infer')
for infile in glob.glob("*.cs"):
df = pd.read_csv(infile, sep = ',', header = 'infer')
name = infile[13:-7]
df['filename'] = name
data.append(df)
result = pd.merge(df1, data.to_frame(), on= 'NAME')
result.to_excel(writer, sheet_name=sheetname)
writer.save()
I also tried pd.concat(data, axis=1, ignore_index=False)
but this doesn't add the NaN because it just concatenates the files based on the column name.
Use merge
with parameter how
equal to 'outer':
df1.merge(df2, on=['NUM','NAME','ORG'], how='outer')
Output:
NUM NAME ORG DATA_x DATA_y
0 1 AAA 10 123.4 111.4
1 1 AAB 20 176.5 NaN
2 1 AAC 30 133.5 122.5
3 2 BBA 12 NaN 156.7
To get your exact output use:
df1.rename(columns={'DATA':'FILE'})\
.merge(df2.rename(columns={'DATA':'FILE'}),
on=['NUM','NAME','ORG'],
how='outer',
suffixes=('1','2'))
Output:
NUM NAME ORG FILE1 FILE2
0 1 AAA 10 123.4 111.4
1 1 AAB 20 176.5 NaN
2 1 AAC 30 133.5 122.5
3 2 BBA 12 NaN 156.7
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With