Hi I am running into this issue where my datasource events looks like this:
event_id device_id timestamp longitude latitude
0 1 29182687948017175 2016-05-01 00:55:25 121.38 31.24
1 2 -6401643145415154744 2016-05-01 00:54:12 103.65 30.97
2 3 -4833982096941402721 2016-05-01 00:08:05 106.60 29.7
I am trying to group the events by the device_id and then get the sum/mean/std of the variable over every event with that device_id:
events['latitude_mean'] = events.groupby(['device_id'])['latitude'].aggregate(np.sum)
But my Output is always:
event_id device_id timestamp longitude latitude
0 1 29182687948017175 2016-05-01 00:55:25 121.38 31.24
1 2 -6401643145415154744 2016-05-01 00:54:12 103.65 30.97
2 3 -4833982096941402721 2016-05-01 00:08:05 106.60 29.70
3 4 -6815121365017318426 2016-05-01 00:06:40 104.27 23.28
4 5 -5373797595892518570 2016-05-01 00:07:18 115.88 28.66
latitude_mean
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
What am I doing wrong to keep getting the return value to be NaN for each row?
NaN values mean "Not-a-Number" which generally means that there are some missing values in the cell. Here, we are going to learn how to groupby column values with NaN values, as the groupby method usually excludes the NaN values hence to include NaN values, we use groupby method with some special parameters.
Returns a groupby object that contains information about the groups. Convenience method for frequency conversion and resampling of time series. See the user guide for more detailed usage and examples, including splitting an object into groups, iterating through groups, selecting a group, aggregation, and more.
groupby(). sum() to group rows based on one or multiple columns and calculate sum agg function. groupby() function returns a DataFrameGroupBy object which contains an aggregate function sum() to calculate a sum of a given column for each group.
agg is an alias for aggregate . Use the alias. A passed user-defined-function will be passed a Series for evaluation. The aggregation is for each column.
you can use pandas.core.groupby.GroupBy.transform(aggfunc) method, which applies aggfunc
to all rows in each group:
In [32]: events['latitude_mean'] = events.groupby(['device_id'])['latitude'].transform('sum')
In [33]: events
Out[33]:
event_id device_id timestamp longitude latitude latitude_mean
0 1 29182687948017175 2016-05-01 00:55:25 121.38 31.24 62.55
1 2 29182687948017175 2016-05-30 12:12:12 777.77 31.31 62.55
2 3 -6401643145415154744 2016-05-01 00:54:12 103.65 30.97 64.30
3 4 -6401643145415154744 2016-01-01 11:11:11 111.11 33.33 64.30
Here you may find some usage examples
Explanation: when you group your DF - as a result you usually have a series containing less rows and with different index, so pandas doesn't know how to align it when assigning it to a new column and as a result you have NaN's:
In [31]: events.groupby(['device_id'])['latitude'].agg(np.sum)
Out[31]:
device_id
-6401643145415154744 64.30
29182687948017175 62.55
Name: latitude, dtype: float64
so when you try to assign it to a new column, pandas does something like this:
In [36]: events['nans'] = pd.Series([1,2], index=['a','b'])
In [38]: events[['event_id','nans']]
Out[38]:
event_id nans
0 1 NaN
1 2 NaN
2 3 NaN
3 4 NaN
Data:
In [30]: events
Out[30]:
event_id device_id timestamp longitude latitude
0 1 29182687948017175 2016-05-01 00:55:25 121.38 31.24
1 2 29182687948017175 2016-05-30 12:12:12 777.77 31.31
2 3 -6401643145415154744 2016-05-01 00:54:12 103.65 30.97
3 4 -6401643145415154744 2016-01-01 11:11:11 111.11 33.33
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With