Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

A per-hour histogram of datetime using Pandas

Assume I have a timestamp column of datetime in a pandas.DataFrame. For the sake of example, the timestamp is in seconds resolution. I would like to bucket / bin the events in 10 minutes [1] buckets / bins. I understand that I can represent the datetime as an integer timestamp and then use histogram. Is there a simpler approach? Something built in into pandas?

[1] 10 minutes is only an example. Ultimately, I would like to use different resolutions.

like image 359
Dror Avatar asked Jan 15 '16 15:01

Dror


People also ask

How do you make a histogram in Python pandas?

In order to plot a histogram using pandas, chain the . hist() function to the dataframe. This will return the histogram for each numeric column in the pandas dataframe.

How do you convert hourly data to daily data in Python?

Resample Hourly Data to Daily Data To simplify your plot which has a lot of data points due to the hourly records, you can aggregate the data for each day using the . resample() method. To aggregate or temporal resample the data for a time period, you can take all of the values for each day and summarize them.

How do pandas deal with DateTime?

Pandas has a built-in function called to_datetime()that converts date and time in string format to a DateTime object. As you can see, the 'date' column in the DataFrame is currently of a string-type object. Thus, to_datetime() converts the column to a series of the appropriate datetime64 dtype.


1 Answers

To use custom frequency like "10Min" you have to use a TimeGrouper -- as suggested by @johnchase -- that operates on the index.

# Generating a sample of 10000 timestamps and selecting 500 to randomize them
df = pd.DataFrame(np.random.choice(pd.date_range(start=pd.to_datetime('2015-01-14'),periods = 10000, freq='S'), 500),  columns=['date'])
# Setting the date as the index since the TimeGrouper works on Index, the date column is not dropped to be able to count
df.set_index('date', drop=False, inplace=True)
# Getting the histogram
df.groupby(pd.TimeGrouper(freq='10Min')).count().plot(kind='bar')

enter image description here

Using to_period

It is also possible to use the to_period method but it does not work -- as far as I know -- with custom period like "10Min". This example take an additional column to simulate the category of an item.

# The number of sample
nb_sample = 500
# Generating a sample and selecting a subset to randomize them
df = pd.DataFrame({'date': np.random.choice(pd.date_range(start=pd.to_datetime('2015-01-14'),periods = nb_sample*30, freq='S'), nb_sample),
                  'type': np.random.choice(['foo','bar','xxx'],nb_sample)})

# Grouping per hour and type
df = df.groupby([df['date'].dt.to_period('H'), 'type']).count().unstack()
# Droping unnecessary column level
df.columns = df.columns.droplevel()
df.plot(kind='bar')

enter image description here

like image 111
Romain Avatar answered Oct 28 '22 13:10

Romain