Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Merging 2 csv data sets with Python a common ID column- one csv has multiple records for a unique ID

I'm very new to Python.Any support is much appreciated

I have two csv files that I'm trying to Merge using a Student_ID column and create a new csv file.

csv 1 : every entry has a unique studentID

Student_ID    Age        Course       startYear
119           24         Bsc          2014

csv2: has multiple records for a studentID as it has a new entry for every subject the student is taking

Student_ID            sub_name       marks      Sub_year_level
119                   Botany1        60         2
119                   Anatomy        70         2
119                   cell bio       75         3
129                   Physics1       78         2
129                   Math1          60         1 

i want to merge the two csv file so that I have all records and columns from csv1 and new additional created columns where I want to get from csv2 the average mark(has to be calculated) for each subject_year_level per student. So the final csv file will have unique Student_Ids in all records

What I want my new output csv file to look like:

Student_ID  Age  Course  startYear  level1_avg_mark  levl2_avg_mark  levl3_avgmark
119         24   Bsc     2014       60               65              70
like image 997
BA stu Avatar asked Mar 16 '17 06:03

BA stu


2 Answers

You can use pivot_table with join:

Notice: parameter fill_value replace NaN to 0, if not necessary remove it and default aggregate function is mean.

df2 = df2.pivot_table(index='Student_ID',  \
                      columns='Sub_year_level',  \
                      values='marks', \
                      fill_value=0) \
         .rename(columns='level{}_avg_mark'.format)
print (df2)
Sub_year_level  level1_avg_mark  level2_avg_mark  level3_avg_mark
Student_ID                                                       
119                           0               65               75
129                          60               78                0

df = df1.join(df2, on='Student_ID')
print (df)
   Student_ID  Age Course  startYear  level1_avg_mark  level2_avg_mark  \
0         119   24    Bsc       2014                0               65   

   level3_avg_mark  
0               75  

EDIT:

Need custom function:

print (df2)
   Student_ID  sub_name  marks  Sub_year_level
0         119   Botany1      0               2
1         119   Botany1      0               2
2         119   Anatomy     72               2
3         119  cell bio     75               3
4         129  Physics1     78               2
5         129     Math1     60               1


f = lambda x:  x[x != 0].mean()
df2 = df2.pivot_table(index='Student_ID',columns='Sub_year_level', values='marks',aggfunc=f)
        .rename(columns='level{}_avg_mark'.format).reset_index()
print (df2)
Sub_year_level  Student_ID  level1_avg_mark  level2_avg_mark  level3_avg_mark
0                      119              NaN             72.0             75.0
1                      129             60.0             78.0              NaN
like image 86
jezrael Avatar answered Sep 17 '22 23:09

jezrael


You can use groupby to calculate the average marks per level.
Then unstack to get all levels in one row.
rename the columns.

Once that is done, the groupby + unstack has conveniently left Student_ID in the index which allows for an easy join. All that is left is to do the join and specify the on parameter.

d1.join(
    d2.groupby(
        ['Student_ID', 'Sub_year_level']
    ).marks.mean().unstack().rename(columns='level{}_avg_mark'.format),
    on='Student_ID'
)

enter image description here

like image 30
piRSquared Avatar answered Sep 17 '22 23:09

piRSquared