I am using Pandas and trying to test something to fully understand some functionalities.
I am grouping and aggregating my data after I load everything from a csv using the following code:
s = df.groupby(['ID','Site']).agg({'Start Date': 'min', 'End Date': 'max', 'Value': 'sum'})
print(s)
and it works with the following file:
but it does not work with this file:
For the second file, I am getting the data only for the 56311 ID. The reason is that some columns have empty values. But that should not matter. I have not found anything relevant about that. I have only found how to exclude the null columns.
Except for this issue, what are the main things that I should take into account before grouping? Is there any chance that rows will be excluded because for example of a format (date or number)?
You can group DataFrame rows into a list by using pandas. DataFrame. groupby() function on the column of interest, select the column you want as a list from group and then use Series. apply(list) to get the list for every group.
What is the GroupBy function? Pandas' GroupBy is a powerful and versatile function in Python. It allows you to split your data into separate groups to perform computations for better analysis.
You can replace blank/empty values with DataFrame. replace() methods. The replace() method replaces the specified value with another specified value on a specified column or on all columns of a DataFrame; replaces every case of the specified value. Yields below output.
In Pandas versions > 1.1.0, you can pass dropna=False
to keep NaN values (see pandas.DataFrame.groupby
).
In [1]: import pandas as pd
In [2]: import numpy as np
In [3]: pd.__version__
Out[3]: '1.1.2'
In [4]: df = pd.DataFrame([[1, 2], [3, 4], [np.nan, 6]], columns=["A", "B"])
In [5]: df
Out[5]:
A B
0 1.0 2
1 3.0 4
2 NaN 6
In [6]: df.groupby("A").mean()
Out[6]:
B
A
1.0 2
3.0 4
In [7]: df.groupby("A", dropna=False).mean()
Out[7]:
B
A
1.0 2
3.0 4
NaN 6
There is problem if NaN
s in columns in by
parameter, then groups are removed.
So need replace NaN
to some value not in Site
column and after groupby replace back to NaN
s:
Thanks Zero
for simplifying solution with fillna
in groupby
:
df1= (df.groupby([df['ID'],df['Site'].fillna('tmp')])
.agg({'Start Date': 'min', 'End Date': 'max', 'Value': 'sum'})
.reset_index()
.replace({'Site':{'tmp': np.nan}}))
If need NaN
s in MultiIndex
:
s = (df.groupby([df['ID'],df['Site'].fillna('tmp')])
.agg({'Start Date': 'min', 'End Date': 'max', 'Value': 'sum'})
.rename(index={'tmp':np.nan}))
Sample:
df = pd.DataFrame({'A':list('abcdef'),
'Site':[np.nan,'a',np.nan,'b','b','a'],
'Start Date':pd.date_range('2017-01-01', periods=6),
'End Date':pd.date_range('2017-11-11', periods=6),
'Value':[7,3,6,9,2,1],
'ID':list('aaabbb')})
print (df)
A End Date ID Site Start Date Value
0 a 2017-11-11 a NaN 2017-01-01 7
1 b 2017-11-12 a a 2017-01-02 3
2 c 2017-11-13 a NaN 2017-01-03 6
3 d 2017-11-14 b b 2017-01-04 9
4 e 2017-11-15 b b 2017-01-05 2
5 f 2017-11-16 b a 2017-01-06 1
df1= (df.groupby([df['ID'],df['Site'].fillna('tmp')])
.agg({'Start Date': 'min', 'End Date': 'max', 'Value': 'sum'})
.reset_index()
.replace({'Site':{'tmp': np.nan}}))
print (df1)
ID Site End Date Start Date Value
0 a a 2017-11-12 2017-01-02 3
1 a NaN 2017-11-13 2017-01-01 13
2 b a 2017-11-16 2017-01-06 1
3 b b 2017-11-15 2017-01-04 11
s = (df.groupby([df['ID'],df['Site'].fillna('tmp')])
.agg({'Start Date': 'min', 'End Date': 'max', 'Value': 'sum'})
.rename(index={'tmp':np.nan}))
print (s)
End Date Start Date Value
ID Site
a a 2017-11-12 2017-01-02 3
NaN 2017-11-13 2017-01-01 13
b a 2017-11-16 2017-01-06 1
b 2017-11-15 2017-01-04 11
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With