Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas: Number of unique days in a timestamp Series

I have Pandas DataFrame with nearly 3,000,000 rows. One of the columns is called TIMESTAMP, and of the datetime64 type. The timestamp format is given below:

2015-03-31 22:56:45.510

My goal is calculating the number of days data were collected. My initial approach was simple:

(df.TIMESTAMP.max() - df.TIMESTAMP.min()).days

However, it occured to me this may not be always correct, since there is no guarantee data was collected everyday. Instead, I tried counting unique days in the timestamp series using map and apply, and both take a considerable amount of time for 3,000,000 rows:

%timeit len(df['TIMESTAMP'].map(lambda t: t.date()).unique())
1 loops, best of 3: 41.3 s per loop

%timeit len(df['TIMESTAMP'].apply(lambda t: t.date()).unique())
1 loops, best of 3: 42.3 s per loop

Is there a way to speed up this computation, or an entirely different but better approach?

Thanks!

like image 323
marillion Avatar asked Nov 10 '15 20:11

marillion


2 Answers

To get the unique dates you should first normalize (to get the time at midnight that day, note this is fast), then use unique:

In [31]: df["Time"].dt.normalize().unique()
Out[31]:
array(['2014-12-31T16:00:00.000000000-0800',
       '2015-01-01T16:00:00.000000000-0800',
       '2015-01-02T16:00:00.000000000-0800',
       '2015-01-04T16:00:00.000000000-0800',
       '2015-01-05T16:00:00.000000000-0800'], dtype='datetime64[ns]')

Original answer (I misread question):

To get the counts could use normalize and then use value_counts:

In [11]: df
Out[11]:
        Time
0 2015-01-01
1 2015-01-02
2 2015-01-03
3 2015-01-03
4 2015-01-05
5 2015-01-06

In [12]: df['Time'].dt.normalize().value_counts()
Out[12]:
2015-01-03    2
2015-01-06    1
2015-01-02    1
2015-01-05    1
2015-01-01    1
Name: Time, dtype: int64

but perhaps the cleaner option is to resample (though I'm not sure if this is less efficient):

In [21]: pd.Series(1, df['Time']).resample("D", how="sum")
Out[21]:
Time
2015-01-01     1
2015-01-02     1
2015-01-03     2
2015-01-04   NaN
2015-01-05     1
2015-01-06     1
Freq: D, dtype: float64
like image 80
Andy Hayden Avatar answered Sep 28 '22 15:09

Andy Hayden


If your index is a DateTimeIndex, I think you can do something like this:

print(df.groupby(df.index.date).shape)
like image 34
reptilicus Avatar answered Sep 28 '22 15:09

reptilicus