I have a pandas data frame like this:
timestamp status
2019-01-01 09:00:00 FAILED
2019-01-01 09:00:00 FAILED
2019-01-01 09:00:00 UNKNOWN
2019-01-01 09:00:00 PASSED
2019-01-01 09:00:00 PASSED
2019-01-01 09:01:00 PASSED
2019-01-01 09:01:00 FAILED
How can I group the data per minute and count the number of each status per minute to get this data frame:
timestamp PASSED FAILED UNKNOWN
2019-01-01 09:00:00 2 2 1
2019-01-01 09:01:00 1 1 0
method 1:
pd.crosstab(df['timestamp'],df['status'])
status FAILED PASSED UNKNOWN
timestamp
2019-01-01-09:00:00 2 2 1
2019-01-01-09:01:00 1 1 0
If you want timestamp like a column :
pd.crosstab(df['timestamp'],df['status'],colnames=[None]).reset_index()
timestamp FAILED PASSED UNKNOWN
0 2019-01-01-09:00:00 2 2 1
1 2019-01-01-09:01:00 1 1 0
method 2:
df.groupby(['timestamp','status']).size().unstack(fill_value=0)
time comparision:
it seems that method 2 is the fastest.
%%timeit
new_df=pd.crosstab(df['timestamp'],df['status'])
21 ms ± 759 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
new_df=df.groupby(['timestamp','status']).size().unstack(fill_value=0)
4.65 ms ± 290 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
df2 = (
df
.groupby(df['timestamp'].map(lambda x: x.replace(second=0)))['status']
.value_counts()
.unstack()
.fillna(0)
.astype(int)
.reset_index()
)
8.5 ms ± 1.52 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
In the event that the timestamps have seconds, you can first remove them to group on whole minutes.
df2 = (
df
.groupby(df['timestamp'].map(lambda x: x.replace(second=0)))['status']
.value_counts()
.unstack(fill_value=0)
.reset_index()
)
>>> df2
status timestamp FAILED PASSED UNKNOWN
0 2019-01-01 09:00:00 2 2 1
1 2019-01-01 09:01:00 1 1 0
You may also wish to fill in every minute in the range. Same code as above, but don't reset the index at the end. Then:
df2 = df2.reindex(pd.date_range(df2.index[0], df2.index[-1], freq='1min'), fill_value=0)
Timings
Timings will certainly vary based on the datasets (small vs large, heterogeneous data vs. homogenous, etc.). Given that the dataset is basically a log, one would expect a lot of data with high variation in the timestamp. To create a more suitable test data, lets make the sample dataframe 100k times larger and then make the timestamps unique (one each minute).
df_ = pd.concat([df] * 100000)
df_['timestamp'] = pd.date_range(df_.timestamp.iat[0], periods=len(df_), freq='1min')
And here are the new timings:
%timeit pd.crosstab(df_['timestamp'],df['status'])
# 4.27 s ± 150 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df_.groupby(['timestamp','status']).size().unstack(fill_value=0)
# 567 ms ± 34.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
(
df_
.groupby(['timestamp', 'status'])
.size()
.unstack(fill_value=0)
.reset_index()
)
# 614 ms ± 27.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
(
df_
.groupby(df['timestamp'].map(lambda x: x.replace(second=0)))['status']
.value_counts()
.unstack(fill_value=0)
.reset_index()
)
# 147 ms ± 6.66 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With