Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extracting the first day of month of a datetime type column in pandas

I have the following dataframe:

user_id    purchase_date    1        2015-01-23 14:05:21   2        2015-02-05 05:07:30   3        2015-02-18 17:08:51   4        2015-03-21 17:07:30   5        2015-03-11 18:32:56   6        2015-03-03 11:02:30 

and purchase_date is a datetime64[ns] column. I need to add a new column df[month] that contains first day of the month of the purchase date:

df['month'] 2015-01-01 2015-02-01 2015-02-01 2015-03-01 2015-03-01 2015-03-01 

I'm looking for something like DATE_FORMAT(purchase_date, "%Y-%m-01") m in SQL. I have tried the following code:

     df['month']=df['purchase_date'].apply(lambda x : x.replace(day=1)) 

It works somehow but returns: 2015-01-01 14:05:21.

like image 465
chessosapiens Avatar asked Jul 25 '17 13:07

chessosapiens


People also ask

How do pandas get first date of the month?

As purchase_date is already in datetime64[ns] format, you can use strftime to format the date to always have the first day of month.

How do you access the components of a date in pandas?

When working with Pandas datetime values, we can use the . dt accessor to access different attributes from a Pandas series. This means that we can extract different parts from a datetime object, such as months, date, and more.


1 Answers

Simpliest and fastest is convert to numpy array by to_numpy and then cast:

df['month'] = df['purchase_date'].to_numpy().astype('datetime64[M]') print (df)    user_id       purchase_date      month 0        1 2015-01-23 14:05:21 2015-01-01 1        2 2015-02-05 05:07:30 2015-02-01 2        3 2015-02-18 17:08:51 2015-02-01 3        4 2015-03-21 17:07:30 2015-03-01 4        5 2015-03-11 18:32:56 2015-03-01 5        6 2015-03-03 11:02:30 2015-03-01 

Another solution with floor and pd.offsets.MonthBegin(1) and add pd.offsets.MonthEnd(0) for correct ouput if first day of month:

df['month'] = (df['purchase_date'].dt.floor('d') +                             pd.offsets.MonthEnd(0) - pd.offsets.MonthBegin(1)) print (df)    user_id       purchase_date      month 0        1 2015-01-23 14:05:21 2015-01-01 1        2 2015-02-05 05:07:30 2015-02-01 2        3 2015-02-18 17:08:51 2015-02-01 3        4 2015-03-21 17:07:30 2015-03-01 4        5 2015-03-11 18:32:56 2015-03-01 5        6 2015-03-03 11:02:30 2015-03-01 

df['month'] = ((df['purchase_date'] + pd.offsets.MonthEnd(0) - pd.offsets.MonthBegin(1))                          .dt.floor('d')) print (df)    user_id       purchase_date      month 0        1 2015-01-23 14:05:21 2015-01-01 1        2 2015-02-05 05:07:30 2015-02-01 2        3 2015-02-18 17:08:51 2015-02-01 3        4 2015-03-21 17:07:30 2015-03-01 4        5 2015-03-11 18:32:56 2015-03-01 5        6 2015-03-03 11:02:30 2015-03-01 

Last solution is create month period by to_period:

df['month'] = df['purchase_date'].dt.to_period('M') print (df)    user_id       purchase_date   month 0        1 2015-01-23 14:05:21 2015-01 1        2 2015-02-05 05:07:30 2015-02 2        3 2015-02-18 17:08:51 2015-02 3        4 2015-03-21 17:07:30 2015-03 4        5 2015-03-11 18:32:56 2015-03 5        6 2015-03-03 11:02:30 2015-03 

... and then to datetimes by to_timestamp, but it is a bit slowier:

df['month'] = df['purchase_date'].dt.to_period('M').dt.to_timestamp() print (df)    user_id       purchase_date      month 0        1 2015-01-23 14:05:21 2015-01-01 1        2 2015-02-05 05:07:30 2015-02-01 2        3 2015-02-18 17:08:51 2015-02-01 3        4 2015-03-21 17:07:30 2015-03-01 4        5 2015-03-11 18:32:56 2015-03-01 5        6 2015-03-03 11:02:30 2015-03-01 

There are many solutions, so:

Timings (in pandas 1.2.3):

rng = pd.date_range('1980-04-01 15:41:12', periods=100000, freq='20H') df = pd.DataFrame({'purchase_date': rng})   print (df.head())    In [70]: %timeit df['purchase_date'].to_numpy().astype('datetime64[M]') 8.6 ms ± 27.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)  In [71]: %timeit df['purchase_date'].dt.floor('d') + pd.offsets.MonthEnd(n=0) - pd.offsets.MonthBegin(n=1) 23 ms ± 130 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)  In [72]: %timeit (df['purchase_date'] + pd.offsets.MonthEnd(0) - pd.offsets.MonthBegin(1)).dt.floor('d') 23.6 ms ± 97.9 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)  In [73]: %timeit df['purchase_date'].dt.to_period('M') 9.25 ms ± 215 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)  In [74]: %timeit df['purchase_date'].dt.to_period('M').dt.to_timestamp() 17.6 ms ± 485 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)   In [76]: %timeit df['purchase_date'] + pd.offsets.MonthEnd(0) - pd.offsets.MonthBegin(normalize=True) 23.1 ms ± 116 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)  In [77]: %timeit df['purchase_date'].dt.normalize().map(MonthBegin().rollback) 1.66 s ± 7.16 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) 
like image 179
jezrael Avatar answered Sep 22 '22 21:09

jezrael