Pandas group by : Include all rows even the ones with empty column values

Tags:

I am using Pandas and trying to test something to fully understand some functionalities.

I am grouping and aggregating my data after I load everything from a csv using the following code:

s = df.groupby(['ID','Site']).agg({'Start Date': 'min', 'End Date': 'max', 'Value': 'sum'})
print(s)

and it works with the following file:

enter image description here

but it does not work with this file:

enter image description here

For the second file, I am getting the data only for the 56311 ID. The reason is that some columns have empty values. But that should not matter. I have not found anything relevant about that. I have only found how to exclude the null columns.

Except for this issue, what are the main things that I should take into account before grouping? Is there any chance that rows will be excluded because for example of a format (date or number)?

617

asked Oct 22 '17 14:10

Datacrawler

2 Answers

In Pandas versions > 1.1.0, you can pass dropna=False to keep NaN values (see pandas.DataFrame.groupby).

In [1]: import pandas as pd

In [2]: import numpy as np

In [3]: pd.__version__
Out[3]: '1.1.2'

In [4]: df = pd.DataFrame([[1, 2], [3, 4], [np.nan, 6]], columns=["A", "B"])

In [5]: df
Out[5]: 
     A  B
0  1.0  2
1  3.0  4
2  NaN  6

In [6]: df.groupby("A").mean()
Out[6]: 
     B
A     
1.0  2
3.0  4

In [7]: df.groupby("A", dropna=False).mean()
Out[7]: 
     B
A     
1.0  2
3.0  4
NaN  6

158

answered Sep 19 '22 15:09

ostrokach

There is problem if NaNs in columns in by parameter, then groups are removed.

So need replace NaN to some value not in Site column and after groupby replace back to NaNs:

Thanks Zero for simplifying solution with fillna in groupby:

df1= (df.groupby([df['ID'],df['Site'].fillna('tmp')])
        .agg({'Start Date': 'min', 'End Date': 'max', 'Value': 'sum'})
        .reset_index()
        .replace({'Site':{'tmp': np.nan}}))

If need NaNs in MultiIndex:

s = (df.groupby([df['ID'],df['Site'].fillna('tmp')])
       .agg({'Start Date': 'min', 'End Date': 'max', 'Value': 'sum'})
       .rename(index={'tmp':np.nan}))

Sample:

df = pd.DataFrame({'A':list('abcdef'),
                   'Site':[np.nan,'a',np.nan,'b','b','a'],
                   'Start Date':pd.date_range('2017-01-01', periods=6),
                   'End Date':pd.date_range('2017-11-11', periods=6),
                   'Value':[7,3,6,9,2,1],
                   'ID':list('aaabbb')})

print (df)
   A   End Date ID Site Start Date  Value
0  a 2017-11-11  a  NaN 2017-01-01      7
1  b 2017-11-12  a    a 2017-01-02      3
2  c 2017-11-13  a  NaN 2017-01-03      6
3  d 2017-11-14  b    b 2017-01-04      9
4  e 2017-11-15  b    b 2017-01-05      2
5  f 2017-11-16  b    a 2017-01-06      1

df1= (df.groupby([df['ID'],df['Site'].fillna('tmp')])
        .agg({'Start Date': 'min', 'End Date': 'max', 'Value': 'sum'})
        .reset_index()
        .replace({'Site':{'tmp': np.nan}}))

print (df1)
  ID Site   End Date Start Date  Value
0  a    a 2017-11-12 2017-01-02      3
1  a  NaN 2017-11-13 2017-01-01     13
2  b    a 2017-11-16 2017-01-06      1
3  b    b 2017-11-15 2017-01-04     11

s = (df.groupby([df['ID'],df['Site'].fillna('tmp')])
       .agg({'Start Date': 'min', 'End Date': 'max', 'Value': 'sum'})
       .rename(index={'tmp':np.nan}))

print (s)
          End Date Start Date  Value
ID Site                             
a  a    2017-11-12 2017-01-02      3
   NaN  2017-11-13 2017-01-01     13
b  a    2017-11-16 2017-01-06      1
   b    2017-11-15 2017-01-04     11

answered Sep 16 '22 15:09

jezrael

Related questions
                            
                                What is the meaning of angle brackets in Python?
                            
                                Can I handle multiple asserts within a single Python pytest method?
                            
                                NumPy ndarray.all() vs np.all(ndarray) vs all(ndarray)
                            
                                Python - Getting and setting clipboard data with subprocesses
                            
                                Using cross validation and AUC-ROC for a logistic regression model in sklearn
                            
                                Python imaplib selecting folders
                            
                                How to find column-index of top-n values within each row of huge dataframe
                            
                                How to test file upload in Django rest framework using PUT?
                            
                                Unpacking result of delayed function
                            
                                Bidirectional LSTM with Batch Normalization in Keras
                            
                                Simple way to convert Python docstrings from reStructured Text to Google style?
                            
                                How to select area within a plot (Python) and extract the data within the area
                            
                                Double underscore in python
                            
                                running multiple cells in jupyter notebook simultaneously
                            
                                Pandas - KeyError: '[] not in index' when training a Keras model
                            
                                Is django-channels suitable for real time game?
                            
                                How to set up logging for aiohttp.client when making request with aiohttp.ClientSession()?
                            
                                Implementing a Trie to support autocomplete in Python
                            
                                Attempt to access dataframe column displays "<bound method NDFrame.xxx..."
                            
                                Neural Network (No hidden layers) vs Logistic Regression?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Pandas group by : Include all rows even the ones with empty column values

Tags:

python

pandas

pandas-groupby

Datacrawler

People also ask

2 Answers

ostrokach

jezrael

Recent Activity

Donate For Us