I have the following dataframe:
user_id purchase_date 1 2015-01-23 14:05:21 2 2015-02-05 05:07:30 3 2015-02-18 17:08:51 4 2015-03-21 17:07:30 5 2015-03-11 18:32:56 6 2015-03-03 11:02:30
and purchase_date
is a datetime64[ns]
column. I need to add a new column df[month]
that contains first day of the month of the purchase date:
df['month'] 2015-01-01 2015-02-01 2015-02-01 2015-03-01 2015-03-01 2015-03-01
I'm looking for something like DATE_FORMAT(purchase_date, "%Y-%m-01") m
in SQL. I have tried the following code:
df['month']=df['purchase_date'].apply(lambda x : x.replace(day=1))
It works somehow but returns: 2015-01-01 14:05:21
.
As purchase_date is already in datetime64[ns] format, you can use strftime to format the date to always have the first day of month.
When working with Pandas datetime values, we can use the . dt accessor to access different attributes from a Pandas series. This means that we can extract different parts from a datetime object, such as months, date, and more.
Simpliest and fastest is convert to numpy array
by to_numpy
and then cast:
df['month'] = df['purchase_date'].to_numpy().astype('datetime64[M]') print (df) user_id purchase_date month 0 1 2015-01-23 14:05:21 2015-01-01 1 2 2015-02-05 05:07:30 2015-02-01 2 3 2015-02-18 17:08:51 2015-02-01 3 4 2015-03-21 17:07:30 2015-03-01 4 5 2015-03-11 18:32:56 2015-03-01 5 6 2015-03-03 11:02:30 2015-03-01
Another solution with floor
and pd.offsets.MonthBegin(1)
and add pd.offsets.MonthEnd(0)
for correct ouput if first day of month:
df['month'] = (df['purchase_date'].dt.floor('d') + pd.offsets.MonthEnd(0) - pd.offsets.MonthBegin(1)) print (df) user_id purchase_date month 0 1 2015-01-23 14:05:21 2015-01-01 1 2 2015-02-05 05:07:30 2015-02-01 2 3 2015-02-18 17:08:51 2015-02-01 3 4 2015-03-21 17:07:30 2015-03-01 4 5 2015-03-11 18:32:56 2015-03-01 5 6 2015-03-03 11:02:30 2015-03-01
df['month'] = ((df['purchase_date'] + pd.offsets.MonthEnd(0) - pd.offsets.MonthBegin(1)) .dt.floor('d')) print (df) user_id purchase_date month 0 1 2015-01-23 14:05:21 2015-01-01 1 2 2015-02-05 05:07:30 2015-02-01 2 3 2015-02-18 17:08:51 2015-02-01 3 4 2015-03-21 17:07:30 2015-03-01 4 5 2015-03-11 18:32:56 2015-03-01 5 6 2015-03-03 11:02:30 2015-03-01
Last solution is create month period
by to_period
:
df['month'] = df['purchase_date'].dt.to_period('M') print (df) user_id purchase_date month 0 1 2015-01-23 14:05:21 2015-01 1 2 2015-02-05 05:07:30 2015-02 2 3 2015-02-18 17:08:51 2015-02 3 4 2015-03-21 17:07:30 2015-03 4 5 2015-03-11 18:32:56 2015-03 5 6 2015-03-03 11:02:30 2015-03
... and then to datetimes
by to_timestamp
, but it is a bit slowier:
df['month'] = df['purchase_date'].dt.to_period('M').dt.to_timestamp() print (df) user_id purchase_date month 0 1 2015-01-23 14:05:21 2015-01-01 1 2 2015-02-05 05:07:30 2015-02-01 2 3 2015-02-18 17:08:51 2015-02-01 3 4 2015-03-21 17:07:30 2015-03-01 4 5 2015-03-11 18:32:56 2015-03-01 5 6 2015-03-03 11:02:30 2015-03-01
There are many solutions, so:
Timings (in pandas 1.2.3):
rng = pd.date_range('1980-04-01 15:41:12', periods=100000, freq='20H') df = pd.DataFrame({'purchase_date': rng}) print (df.head()) In [70]: %timeit df['purchase_date'].to_numpy().astype('datetime64[M]') 8.6 ms ± 27.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) In [71]: %timeit df['purchase_date'].dt.floor('d') + pd.offsets.MonthEnd(n=0) - pd.offsets.MonthBegin(n=1) 23 ms ± 130 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) In [72]: %timeit (df['purchase_date'] + pd.offsets.MonthEnd(0) - pd.offsets.MonthBegin(1)).dt.floor('d') 23.6 ms ± 97.9 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) In [73]: %timeit df['purchase_date'].dt.to_period('M') 9.25 ms ± 215 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) In [74]: %timeit df['purchase_date'].dt.to_period('M').dt.to_timestamp() 17.6 ms ± 485 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) In [76]: %timeit df['purchase_date'] + pd.offsets.MonthEnd(0) - pd.offsets.MonthBegin(normalize=True) 23.1 ms ± 116 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) In [77]: %timeit df['purchase_date'].dt.normalize().map(MonthBegin().rollback) 1.66 s ± 7.16 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With