Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Float values when pivoting dataframe of integers (pandas - python)

I have a huge pandas dataframe, shaped like this example:

 new_id hour names values
    0   0   mark    5
    0   0   matt    4
    0   0   alex    3
    1   0   roger   2
    1   0   arthur  7
    1   1   alf     8
    2   1   ale     6
    3   1   peter   5
    3   2   tom     2
    4   2   andrew  7

I need to reshape it, so I use pivot_table():

dummy=dummy.pivot_table(index=['hour','new_id'],columns='name', values='values').fillna(0)

so it becomes

        names   ale alex alf andrew arthur mark matt peter roger tom
hour    new_id                                      
0           0   0.0 3.0 0.0 0.0 0.0 5.0 4.0 0.0 0.0 0.0
            1   0.0 0.0 0.0 0.0 7.0 0.0 0.0 0.0 2.0 0.0
1           1   0.0 0.0 8.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
            2   6.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
            3   0.0 0.0 0.0 0.0 0.0 0.0 0.0 5.0 0.0 0.0
2           3   0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2.0
            4   0.0 0.0 0.0 7.0 0.0 0.0 0.0 0.0 0.0 0.0
....

By the way, this small example can't reproduce my problem: in the real dataset, when I pivot it, I get some float values which shouldn't be there, since they are the aggregation and the sum of the values of the initial dataset, which are all integers. Not only they are float, but they are also quite far from the exact results.

Why do I get these float values? Is there a better way to get what I want? I don't really want to write by myself a function to sum properly all the values before pivoting the dataframe, since this should be exactly what pivot_table()does.

like image 609
sato Avatar asked Jan 28 '23 02:01

sato


1 Answers

There is problem NaNs, which convert all values to floats so possible solution is add parameter fill_value=0 if input data are integers:

dummy=dummy.pivot_table(index=['hour','new_id'],columns='name', values='values', fill_value=0)
print (dummy)
name         ale  alex  alf  andrew  arthur  mark  matt  peter  roger  tom
hour new_id                                                               
0    0         0     3    0       0       0     5     4      0      0    0
     1         0     0    0       0       7     0     0      0      2    0
1    1         0     0    8       0       0     0     0      0      0    0
     2         6     0    0       0       0     0     0      0      0    0
     3         0     0    0       0       0     0     0      5      0    0
2    3         0     0    0       0       0     0     0      0      0    2
     4         0     0    0       7       0     0     0      0      0    0

Default aggregate function in pivot_table is mean, so is expected at least one float value in output, so it convert all values to floats.

So if change aggregate function to sum all working nice:

dummy = dummy.pivot_table(index=['hour','new_id'],
                          columns='name', 
                          values='values', 
                          fill_value=0, 
                          aggfunc='sum')
like image 167
jezrael Avatar answered Jan 31 '23 12:01

jezrael