I am using Pandas and trying to test something to fully understand some functionalities.
I am grouping and aggregating my data after I load everything from a csv using the following code:
s = df.groupby(['ID','Site']).agg({'Start Date': 'min', 'End Date': 'max', 'Value': 'sum'})
print(s)
and it works with the following file:

but it does not work with this file:

For the second file, I am getting the data only for the 56311 ID. The reason is that some columns have empty values. But that should not matter. I have not found anything relevant about that. I have only found how to exclude the null columns.
Except for this issue, what are the main things that I should take into account before grouping? Is there any chance that rows will be excluded because for example of a format (date or number)?
You can group DataFrame rows into a list by using pandas. DataFrame. groupby() function on the column of interest, select the column you want as a list from group and then use Series. apply(list) to get the list for every group.
What is the GroupBy function? Pandas' GroupBy is a powerful and versatile function in Python. It allows you to split your data into separate groups to perform computations for better analysis.
You can replace blank/empty values with DataFrame. replace() methods. The replace() method replaces the specified value with another specified value on a specified column or on all columns of a DataFrame; replaces every case of the specified value. Yields below output.
In Pandas versions > 1.1.0, you can pass dropna=False to keep NaN values (see pandas.DataFrame.groupby).
In [1]: import pandas as pd
In [2]: import numpy as np
In [3]: pd.__version__
Out[3]: '1.1.2'
In [4]: df = pd.DataFrame([[1, 2], [3, 4], [np.nan, 6]], columns=["A", "B"])
In [5]: df
Out[5]:
A B
0 1.0 2
1 3.0 4
2 NaN 6
In [6]: df.groupby("A").mean()
Out[6]:
B
A
1.0 2
3.0 4
In [7]: df.groupby("A", dropna=False).mean()
Out[7]:
B
A
1.0 2
3.0 4
NaN 6
There is problem if NaNs in columns in by parameter, then groups are removed.
So need replace NaN to some value not in Site column and after groupby replace back to NaNs:
Thanks Zero for simplifying solution with fillna in groupby:
df1= (df.groupby([df['ID'],df['Site'].fillna('tmp')])
.agg({'Start Date': 'min', 'End Date': 'max', 'Value': 'sum'})
.reset_index()
.replace({'Site':{'tmp': np.nan}}))
If need NaNs in MultiIndex:
s = (df.groupby([df['ID'],df['Site'].fillna('tmp')])
.agg({'Start Date': 'min', 'End Date': 'max', 'Value': 'sum'})
.rename(index={'tmp':np.nan}))
Sample:
df = pd.DataFrame({'A':list('abcdef'),
'Site':[np.nan,'a',np.nan,'b','b','a'],
'Start Date':pd.date_range('2017-01-01', periods=6),
'End Date':pd.date_range('2017-11-11', periods=6),
'Value':[7,3,6,9,2,1],
'ID':list('aaabbb')})
print (df)
A End Date ID Site Start Date Value
0 a 2017-11-11 a NaN 2017-01-01 7
1 b 2017-11-12 a a 2017-01-02 3
2 c 2017-11-13 a NaN 2017-01-03 6
3 d 2017-11-14 b b 2017-01-04 9
4 e 2017-11-15 b b 2017-01-05 2
5 f 2017-11-16 b a 2017-01-06 1
df1= (df.groupby([df['ID'],df['Site'].fillna('tmp')])
.agg({'Start Date': 'min', 'End Date': 'max', 'Value': 'sum'})
.reset_index()
.replace({'Site':{'tmp': np.nan}}))
print (df1)
ID Site End Date Start Date Value
0 a a 2017-11-12 2017-01-02 3
1 a NaN 2017-11-13 2017-01-01 13
2 b a 2017-11-16 2017-01-06 1
3 b b 2017-11-15 2017-01-04 11
s = (df.groupby([df['ID'],df['Site'].fillna('tmp')])
.agg({'Start Date': 'min', 'End Date': 'max', 'Value': 'sum'})
.rename(index={'tmp':np.nan}))
print (s)
End Date Start Date Value
ID Site
a a 2017-11-12 2017-01-02 3
NaN 2017-11-13 2017-01-01 13
b a 2017-11-16 2017-01-06 1
b 2017-11-15 2017-01-04 11
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With