Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas group by : Include all rows even the ones with empty column values

I am using Pandas and trying to test something to fully understand some functionalities.

I am grouping and aggregating my data after I load everything from a csv using the following code:

s = df.groupby(['ID','Site']).agg({'Start Date': 'min', 'End Date': 'max', 'Value': 'sum'})
print(s)

and it works with the following file:

enter image description here

but it does not work with this file:

enter image description here

For the second file, I am getting the data only for the 56311 ID. The reason is that some columns have empty values. But that should not matter. I have not found anything relevant about that. I have only found how to exclude the null columns.

Except for this issue, what are the main things that I should take into account before grouping? Is there any chance that rows will be excluded because for example of a format (date or number)?

like image 617
Datacrawler Avatar asked Oct 22 '17 14:10

Datacrawler


People also ask

How do I group specific rows in pandas?

You can group DataFrame rows into a list by using pandas. DataFrame. groupby() function on the column of interest, select the column you want as a list from group and then use Series. apply(list) to get the list for every group.

What does Group_by do in pandas?

What is the GroupBy function? Pandas' GroupBy is a powerful and versatile function in Python. It allows you to split your data into separate groups to perform computations for better analysis.

How do I fill empty columns in pandas?

You can replace blank/empty values with DataFrame. replace() methods. The replace() method replaces the specified value with another specified value on a specified column or on all columns of a DataFrame; replaces every case of the specified value. Yields below output.


2 Answers

In Pandas versions > 1.1.0, you can pass dropna=False to keep NaN values (see pandas.DataFrame.groupby).

In [1]: import pandas as pd

In [2]: import numpy as np

In [3]: pd.__version__
Out[3]: '1.1.2'

In [4]: df = pd.DataFrame([[1, 2], [3, 4], [np.nan, 6]], columns=["A", "B"])

In [5]: df
Out[5]: 
     A  B
0  1.0  2
1  3.0  4
2  NaN  6

In [6]: df.groupby("A").mean()
Out[6]: 
     B
A     
1.0  2
3.0  4

In [7]: df.groupby("A", dropna=False).mean()
Out[7]: 
     B
A     
1.0  2
3.0  4
NaN  6
like image 158
ostrokach Avatar answered Sep 19 '22 15:09

ostrokach


There is problem if NaNs in columns in by parameter, then groups are removed.

So need replace NaN to some value not in Site column and after groupby replace back to NaNs:

Thanks Zero for simplifying solution with fillna in groupby:

df1= (df.groupby([df['ID'],df['Site'].fillna('tmp')])
        .agg({'Start Date': 'min', 'End Date': 'max', 'Value': 'sum'})
        .reset_index()
        .replace({'Site':{'tmp': np.nan}}))

If need NaNs in MultiIndex:

s = (df.groupby([df['ID'],df['Site'].fillna('tmp')])
       .agg({'Start Date': 'min', 'End Date': 'max', 'Value': 'sum'})
       .rename(index={'tmp':np.nan}))

Sample:

df = pd.DataFrame({'A':list('abcdef'),
                   'Site':[np.nan,'a',np.nan,'b','b','a'],
                   'Start Date':pd.date_range('2017-01-01', periods=6),
                   'End Date':pd.date_range('2017-11-11', periods=6),
                   'Value':[7,3,6,9,2,1],
                   'ID':list('aaabbb')})

print (df)
   A   End Date ID Site Start Date  Value
0  a 2017-11-11  a  NaN 2017-01-01      7
1  b 2017-11-12  a    a 2017-01-02      3
2  c 2017-11-13  a  NaN 2017-01-03      6
3  d 2017-11-14  b    b 2017-01-04      9
4  e 2017-11-15  b    b 2017-01-05      2
5  f 2017-11-16  b    a 2017-01-06      1

df1= (df.groupby([df['ID'],df['Site'].fillna('tmp')])
        .agg({'Start Date': 'min', 'End Date': 'max', 'Value': 'sum'})
        .reset_index()
        .replace({'Site':{'tmp': np.nan}}))

print (df1)
  ID Site   End Date Start Date  Value
0  a    a 2017-11-12 2017-01-02      3
1  a  NaN 2017-11-13 2017-01-01     13
2  b    a 2017-11-16 2017-01-06      1
3  b    b 2017-11-15 2017-01-04     11

s = (df.groupby([df['ID'],df['Site'].fillna('tmp')])
       .agg({'Start Date': 'min', 'End Date': 'max', 'Value': 'sum'})
       .rename(index={'tmp':np.nan}))

print (s)
          End Date Start Date  Value
ID Site                             
a  a    2017-11-12 2017-01-02      3
   NaN  2017-11-13 2017-01-01     13
b  a    2017-11-16 2017-01-06      1
   b    2017-11-15 2017-01-04     11
like image 34
jezrael Avatar answered Sep 16 '22 15:09

jezrael