I'm very new to Python.Any support is much appreciated
I have two csv files that I'm trying to Merge using a Student_ID column and create a new csv file.
csv 1 : every entry has a unique studentID
Student_ID Age Course startYear
119 24 Bsc 2014
csv2: has multiple records for a studentID as it has a new entry for every subject the student is taking
Student_ID sub_name marks Sub_year_level
119 Botany1 60 2
119 Anatomy 70 2
119 cell bio 75 3
129 Physics1 78 2
129 Math1 60 1
i want to merge the two csv file so that I have all records and columns from csv1 and new additional created columns where I want to get from csv2 the average mark(has to be calculated) for each subject_year_level per student. So the final csv file will have unique Student_Ids in all records
What I want my new output csv file to look like:
Student_ID Age Course startYear level1_avg_mark levl2_avg_mark levl3_avgmark
119 24 Bsc 2014 60 65 70
You can use pivot_table
with join
:
Notice: parameter fill_value
replace NaN
to 0
, if not necessary remove it and default aggregate function is mean
.
df2 = df2.pivot_table(index='Student_ID', \
columns='Sub_year_level', \
values='marks', \
fill_value=0) \
.rename(columns='level{}_avg_mark'.format)
print (df2)
Sub_year_level level1_avg_mark level2_avg_mark level3_avg_mark
Student_ID
119 0 65 75
129 60 78 0
df = df1.join(df2, on='Student_ID')
print (df)
Student_ID Age Course startYear level1_avg_mark level2_avg_mark \
0 119 24 Bsc 2014 0 65
level3_avg_mark
0 75
EDIT:
Need custom function:
print (df2)
Student_ID sub_name marks Sub_year_level
0 119 Botany1 0 2
1 119 Botany1 0 2
2 119 Anatomy 72 2
3 119 cell bio 75 3
4 129 Physics1 78 2
5 129 Math1 60 1
f = lambda x: x[x != 0].mean()
df2 = df2.pivot_table(index='Student_ID',columns='Sub_year_level', values='marks',aggfunc=f)
.rename(columns='level{}_avg_mark'.format).reset_index()
print (df2)
Sub_year_level Student_ID level1_avg_mark level2_avg_mark level3_avg_mark
0 119 NaN 72.0 75.0
1 129 60.0 78.0 NaN
You can use groupby
to calculate the average marks per level.
Then unstack
to get all levels in one row.rename
the columns.
Once that is done, the groupby
+ unstack
has conveniently left Student_ID
in the index which allows for an easy join
. All that is left is to do the join
and specify the on
parameter.
d1.join(
d2.groupby(
['Student_ID', 'Sub_year_level']
).marks.mean().unstack().rename(columns='level{}_avg_mark'.format),
on='Student_ID'
)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With