Grouping Pandas DataFrame by n days starting in the begining of the day

Tags:

pandas

I have just discovered the power of Pandas and I love it, but I can't figure out this problem:

I have a DataFrame df.head():

   lon   lat  h  filename                  time
0  19.961216  80.617627    -0.077165     60048 2002-05-15 12:59:31.717467
1  19.923916  80.614847    -0.018689     60048 2002-05-15 12:59:31.831467
2  19.849396  80.609257    -0.089205     60048 2002-05-15 12:59:32.059467
3  19.830776  80.607857     0.076485     60048 2002-05-15 12:59:32.116467
4  19.570708  80.588183     0.162943     60048 2002-05-15 12:59:32.888467

I would like to group my data into nine day intervals

gb = df.groupby(pd.TimeGrouper(key='time', freq='9D'))

The first group:

2002-05-15 12:59:31.717467       lon   lat  h filename                  time
0    19.961216  80.617627    -0.077165     60048 2002-05-15 12:59:31.717467
1    19.923916  80.614847    -0.018689     60048 2002-05-15 12:59:31.831467
2    19.849396  80.609257    -0.089205     60048 2002-05-15 12:59:32.059467
3    19.830776  80.607857     0.076485     60048 2002-05-15 12:59:32.116467
...

Next group:

2002-05-24 12:59:31.717467        lon   lat  height  filename                  time
815   18.309498  80.457024     0.187387     60309 2002-05-24 16:35:39.553563
816   18.291458  80.458514     0.061446     60309 2002-05-24 16:35:39.610563
817   18.273408  80.460014     0.129255     60309 2002-05-24 16:35:39.667563
818   18.255358  80.461504     0.046761     60309 2002-05-24 16:35:39.724563
...

So the data are grouped in nine days counting from the first time ( 12:59:31.717467), and not from the beginning of the day as I would like.

When grouping by one day:

gb = df.groupby(pd.TimeGrouper(key='time', freq='D'))

gives me:

2002-05-15 00:00:00       lon   lat  h  filename                  time
0    19.961216  80.617627    -0.077165     60048 2002-05-15 12:59:31.717467
1    19.923916  80.614847    -0.018689     60048 2002-05-15 12:59:31.831467
2    19.849396  80.609257    -0.089205     60048 2002-05-15 12:59:32.059467
3    19.830776  80.607857     0.076485     60048 2002-05-15 12:59:32.116467
...

I can just loop over the days until I get a nine day interval, but I think it could be done smarter, I am looking for a Grouper freq option equivalent to YS (start of year) just for days, a way of setting the start time (maybe by the Grouper option convention : {‘start’, ‘end’, ‘e’, ‘s’}), or???

I am running Python 3.5.2 and Pandas is in version: 0.19.0

255

asked Nov 11 '16 14:11

user1643523

1 Answers

Dropping first time row:

Your best bet would be to normalize the first row of the datetime column so that the time is reset to 00:00:00(midnight) and group according to the 9D interval:

df.loc[0, 'time'] = df['time'].iloc[0].normalize()
for _, grp in df.groupby(pd.TimeGrouper(key='time', freq='9D')):
    print (grp)

#          lon        lat         h  filename                       time
# 0  19.961216  80.617627 -0.077165     60048 2002-05-15 00:00:00.000000
# 1  19.923916  80.614847 -0.018689     60048 2002-05-15 12:59:31.831467
# 2  19.849396  80.609257 -0.089205     60048 2002-05-15 12:59:32.059467
# 3  19.830776  80.607857  0.076485     60048 2002-05-15 12:59:32.116467
# 4  19.570708  80.588183  0.162943     60048 2002-05-15 12:59:32.888467
# ......................................................................

This restores the time in the other rows and so you wouldn't lose that information.

Keeping first time row:

If you want to keep the first time row as it is and not make any changes to it, but only want to start grouping from midnight onwards, you could do:

df_t_shift = df.shift()    # Shift one level down
df_t_shift.loc[0, 'time'] = df_t_shift['time'].iloc[1].normalize()
# Concat last row of df with the shifted one to account for the loss of row
df_t_shift = df_t_shift.append(df.iloc[-1], ignore_index=True)  

for _, grp in df_t_shift.groupby(pd.TimeGrouper(key='time', freq='9D')):
    print (grp)

#          lon        lat         h  filename                       time
# 0        NaN        NaN       NaN       NaN 2002-05-15 00:00:00.000000
# 1  19.961216  80.617627 -0.077165   60048.0 2002-05-15 12:59:31.717467
# 2  19.923916  80.614847 -0.018689   60048.0 2002-05-15 12:59:31.831467
# 3  19.849396  80.609257 -0.089205   60048.0 2002-05-15 12:59:32.059467
# 4  19.830776  80.607857  0.076485   60048.0 2002-05-15 12:59:32.116467
# 5  19.570708  80.588183  0.162943   60048.0 2002-05-15 12:59:32.888467

112

answered Nov 13 '22 09:11

Nickil Maveli

Related questions
                            
                                How to make python scripts pipe-able both in bash and within python
                            
                                How to Access/Download OneNote notebook with Python?
                            
                                Dask DataFrame Groupby Partitions
                            
                                Adding Colorbar to a Spectrogram
                            
                                pytest fixture of fixtures
                            
                                extracting phase information using numpy fft
                            
                                Plotly: How to add borders and sidelabels to subplots, and syncronize panning?
                            
                                Is it possible to let PyCharm auto break line when writing long docstrings and comments?
                            
                                How to use Pretty Table in Python to print out data from multiple lists?
                            
                                Pandas to_dict unwantedly modifying float numbers
                            
                                How can I use conda skeleton with packages that are not uploaded to pypi?
                            
                                How to register Entry Points for network python package installs?
                            
                                Python with non-latin-1 PYTHONHOME path
                            
                                Found array with 0 sample(s) (shape=(0, 40)) while a minimum of 1 is required
                            
                                Does Django's singleton architecture make it unworkable as a standalone ORM in a library?
                            
                                How can I run Processing's Python mode in non-Processing IDEs?
                            
                                How to monkey patch python list __setitem__ method
                            
                                good style to introduce python variables within a loop
                            
                                How to get classes labels from cross_val_predict used with predict_proba in scikit-learn
                            
                                How can the size of the tk embedded xterm be dynamic?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With