I have some data from log files and would like to group entries by a minute: <pre class="prettyprint"><code> def gen(date, count=10): while count > 0: yield date, "event{}".format(randint(1,9)), "source{}".format(randint(1,3)) count -= 1 date += DateOffset(seconds=randint(40)) df = DataFrame.from_records(list(gen(datetime(2012,1,1,12, 30))), index='Time', columns=['Time', 'Event', 'Source']) </code></pre> df: <pre class="prettyprint"><code> Event Source 2012-01-01 12:30:00 event3 source1 2012-01-01 12:30:12 event2 source2 2012-01-01 12:30:12 event2 source2 2012-01-01 12:30:29 event6 source1 2012-01-01 12:30:38 event1 source1 2012-01-01 12:31:05 event4 source2 2012-01-01 12:31:38 event4 source1 2012-01-01 12:31:44 event5 source1 2012-01-01 12:31:48 event5 source2 2012-01-01 12:32:23 event6 source1 </code></pre> I tried these options: <ol> <li> <code>df.resample('Min')</code> is too high level and wants to aggregate.</li> <li> <code>df.groupby(date_range(datetime(2012,1,1,12, 30), freq='Min', periods=4))</code> fails with exception.</li> <li> <code>df.groupby(TimeGrouper(freq='Min'))</code> works fine and returns a <code>DataFrameGroupBy</code> object for further processing, e.g.: <pre class="prettyprint"><code>grouped = df.groupby(TimeGrouper(freq='Min')) grouped.Source.value_counts() 2012-01-01 12:30:00 source1 1 2012-01-01 12:31:00 source2 2 source1 2 2012-01-01 12:32:00 source2 2 source1 2 2012-01-01 12:33:00 source1 1 </code></pre> </li> </ol> However, the <code>TimeGrouper</code> class is not documented. What is the correct way to group by a period of time? How can I group the data by a minute AND by the Source column, e.g. <code>groupby([TimeGrouper(freq='Min'), df.Source])</code>?

You can group on any array/Series of the same length as your DataFrame --- even a computed factor that's not actually a column of the DataFrame. So to group by minute you can do: <pre class="prettyprint"><code>df.groupby(df.index.map(lambda t: t.minute)) </code></pre> If you want to group by minute and something else, just mix the above with the column you want to use: <pre class="prettyprint"><code>df.groupby([df.index.map(lambda t: t.minute), 'Source']) </code></pre> Personally I find it useful to just add columns to the DataFrame to store some of these computed things (e.g., a "Minute" column) if I want to group by them often, since it makes the grouping code less verbose. Or you could try something like this: <pre class="prettyprint"><code>df.groupby([df['Source'],pd.TimeGrouper(freq='Min')]) </code></pre>

Since the original answer is rather old and pandas introduced periods a different solution is nowadays: <pre class="prettyprint"><code>df.groupby(df.index.to_period('T')) </code></pre> Additionally, you can resample <pre class="prettyprint"><code>df.resample('T') </code></pre>

How to group DataFrame by a period of time?

Tags:

python

pandas

I have some data from log files and would like to group entries by a minute:

 def gen(date, count=10):      while count > 0:          yield date, "event{}".format(randint(1,9)), "source{}".format(randint(1,3))          count -= 1          date += DateOffset(seconds=randint(40))   df = DataFrame.from_records(list(gen(datetime(2012,1,1,12, 30))), index='Time', columns=['Time', 'Event', 'Source'])

df:

 Event  Source  2012-01-01 12:30:00     event3  source1  2012-01-01 12:30:12     event2  source2  2012-01-01 12:30:12     event2  source2  2012-01-01 12:30:29     event6  source1  2012-01-01 12:30:38     event1  source1  2012-01-01 12:31:05     event4  source2  2012-01-01 12:31:38     event4  source1  2012-01-01 12:31:44     event5  source1  2012-01-01 12:31:48     event5  source2  2012-01-01 12:32:23     event6  source1

I tried these options:

df.resample('Min') is too high level and wants to aggregate.
df.groupby(date_range(datetime(2012,1,1,12, 30), freq='Min', periods=4)) fails with exception.

df.groupby(TimeGrouper(freq='Min')) works fine and returns a DataFrameGroupBy object for further processing, e.g.:

grouped = df.groupby(TimeGrouper(freq='Min')) grouped.Source.value_counts() 2012-01-01 12:30:00  source1    1 2012-01-01 12:31:00  source2    2                      source1    2 2012-01-01 12:32:00  source2    2                      source1    2 2012-01-01 12:33:00  source1    1

However, the TimeGrouper class is not documented.

What is the correct way to group by a period of time? How can I group the data by a minute AND by the Source column, e.g. groupby([TimeGrouper(freq='Min'), df.Source])?

717

asked Jun 17 '12 18:06

serguei

2 Answers

You can group on any array/Series of the same length as your DataFrame --- even a computed factor that's not actually a column of the DataFrame. So to group by minute you can do:

df.groupby(df.index.map(lambda t: t.minute))

If you want to group by minute and something else, just mix the above with the column you want to use:

df.groupby([df.index.map(lambda t: t.minute), 'Source'])

Personally I find it useful to just add columns to the DataFrame to store some of these computed things (e.g., a "Minute" column) if I want to group by them often, since it makes the grouping code less verbose.

Or you could try something like this:

df.groupby([df['Source'],pd.TimeGrouper(freq='Min')])

105

answered Oct 17 '22 02:10

BrenBarn

Since the original answer is rather old and pandas introduced periods a different solution is nowadays:

df.groupby(df.index.to_period('T'))

Additionally, you can resample

df.resample('T')

answered Oct 17 '22 03:10

Quickbeam2k1

Related questions
                            
                                Negative integer division surprising result
                            
                                Why does Tkinter image not show up if created in a function?
                            
                                Python package name conventions
                            
                                How to stream an HttpResponse with Django
                            
                                Python glob but against a list of strings rather than the filesystem
                            
                                How to split Vector into columns - using PySpark
                            
                                negative zero in python
                            
                                Using the __call__ method of a metaclass instead of __new__?
                            
                                Pylint showing invalid variable name in output
                            
                                Ruby equivalent of Python's "dir"?
                            
                                How to write bytes to a file in Python 3 without knowing the encoding?
                            
                                Subclassing int in Python
                            
                                High Memory Usage Using Python Multiprocessing
                            
                                How to do Decimal to float conversion in Python?
                            
                                How to automatically destroy django test database
                            
                                How can I use io.StringIO() with the csv module?
                            
                                How to access sparse matrix elements?
                            
                                Python mock call_args_list unpacking tuples for assertion on arguments
                            
                                Scope of variable within "with" statement?
                            
                                Pandas isna() and isnull(), what is the difference?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With