pandas - groupby and filtering for consecutive values

Tags:

I have this dataframe df:

U,Datetime
01,2015-01-01 20:00:00
01,2015-02-01 20:05:00
01,2015-04-01 21:00:00
01,2015-05-01 22:00:00
01,2015-07-01 22:05:00
02,2015-08-01 20:00:00
02,2015-09-01 21:00:00
02,2014-01-01 23:00:00
02,2014-02-01 22:05:00
02,2015-01-01 20:00:00
02,2014-03-01 21:00:00
03,2015-10-01 20:00:00
03,2015-11-01 21:00:00
03,2015-12-01 23:00:00
03,2015-01-01 22:05:00
03,2015-02-01 20:00:00
03,2015-05-01 21:00:00
03,2014-01-01 20:00:00
03,2014-02-01 21:00:00

made by U and a Datetime object. What I would like to do is to filter U values having at least three consecutive occurrences in months/year. So far I have grouped by by U, year and month as:

m = df.groupby(['U',df.index.year,df.index.month]).size()

obtaining:

U          
1  2015  1     1
         2     1
         4     1
         5     1
         7     1
2  2014  1     1
         2     1
         3     1
   2015  1     1
         8     1
         9     1
3  2014  1     1
         2     1
   2015  1     1
         2     1
         5     1
         10    1
         11    1
         12    1

The third column is related to the occurrences in different months/year. In this case only U values of 02 and 03 contain at least three consecutive values in months/year. Now I can't figured out how can I select those users and getting them out in a list, for instance, or just keeping them in the original dataframe df and discard the others. I tried also:

g = m.groupby(level=[0,1]).diff()

But I can't get any useful information.

746

asked Nov 18 '15 14:11

Fabio Lamanna

1 Answers

Finally I could come up with the solution :) .

to give you an idea of how custom function works , simply it subtracts the value of the month from it's preceding value , the result should be one of course , and this should happen twice , for example if you have a list of numbers [5 , 6 , 7] , so 7 - 6 = 1 and 6 - 5 = 1 , 1 here appeared twice so the condition has been fulfilled

In [80]:
df.reset_index(inplace=True)

In [281]:
df['month'] = df.Datetime.dt.month
df['year'] = df.Datetime.dt.year
df
Out[281]:
            Datetime    U   month   year
0   2015-01-01 20:00:00 1   1       2015
1   2015-02-01 20:05:00 1   2       2015
2   2015-04-01 21:00:00 1   4       2015
3   2015-05-01 22:00:00 1   5       2015
4   2015-07-01 22:05:00 1   7       2015
5   2015-08-01 20:00:00 2   8       2015
6   2015-09-01 21:00:00 2   9       2015
7   2014-01-01 23:00:00 2   1       2014
8   2014-02-01 22:05:00 2   2       2014
9   2015-01-01 20:00:00 2   1       2015
10  2014-03-01 21:00:00 2   3       2014
11  2015-10-01 20:00:00 3   10      2015
12  2015-11-01 21:00:00 3   11      2015
13  2015-12-01 23:00:00 3   12      2015
14  2015-01-01 22:05:00 3   1       2015
15  2015-02-01 20:00:00 3   2       2015
16  2015-05-01 21:00:00 3   5       2015
17  2014-01-01 20:00:00 3   1       2014
18  2014-02-01 21:00:00 3   2       2014

In [284]:
g = df.groupby([df['U'] , df.year])

In [86]:
res = g.filter(lambda x : is_at_least_three_consec(x['month'].diff().values.tolist()))
res
Out[86]:
      Datetime          U   month   year
7   2014-01-01 23:00:00 2   1       2014
8   2014-02-01 22:05:00 2   2       2014
10  2014-03-01 21:00:00 2   3       2014
11  2015-10-01 20:00:00 3   10      2015
12  2015-11-01 21:00:00 3   11      2015
13  2015-12-01 23:00:00 3   12      2015
14  2015-01-01 22:05:00 3   1       2015
15  2015-02-01 20:00:00 3   2       2015
16  2015-05-01 21:00:00 3   5       2015

if you want to see the result of the custom function

In [84]:
res = g['month'].agg(lambda x : is_at_least_three_consec(x.diff().values.tolist()))
res
Out[84]:
U  year
1  2015    False
2  2014     True
   2015    False
3  2014    False
   2015     True
Name: month, dtype: bool

this is how custom function implemented

In [53]:    
def is_at_least_three_consec(month_diff):
    consec_count = 0
    #print(month_diff)
    for index , val in enumerate(month_diff):
        if index != 0 and val == 1:
                consec_count += 1
                if consec_count == 2:
                    return True
        else:
            consec_count = 0

    return False

answered Oct 06 '22 13:10

Nader Hisham

Related questions
                            
                                Properly Designing a Multiprocessing.Manager Custom Object
                            
                                Distributed Task Queue Based on Sets as a Data Structure instead of Lists
                            
                                Generating long-run Gray codes
                            
                                Cross-validate precision, recall and f1 together with sklearn
                            
                                How to group intersecting Shapely geometric objects in a list of tuples
                            
                                install opencv into a virtual environment
                            
                                Python 2.7 - Redirect handler isn't passing parameters on re-direct
                            
                                Meld has errors on OSX 10.10
                            
                                Python multiprocessing on Windows 10
                            
                                Generate Random Binary Matrix
                            
                                Get mouse events outside of Tkinter window in Python
                            
                                Terminal display of input goes out of sync while/after using python? (temporary fix = `reset`)
                            
                                Stochastic integration with python
                            
                                Difference between Bytearray and List in Python
                            
                                Parsing a xml.gz file in python
                            
                                Identify unique groupings of polygons in Geopandas / Shapely
                            
                                Pandas - Get dummies for only certain values
                            
                                Unit test packages Maven style convention
                            
                                Using a specific flatpage in a template
                            
                                Multi-part form using Flask / WTForms

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

pandas - groupby and filtering for consecutive values

Tags:

python

pandas

dataframe

time-series

Fabio Lamanna

People also ask

1 Answers

Nader Hisham

Recent Activity

Donate For Us