Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas: Change day

Tags:

I have a datetime series, and need to change the day to 1 for each entry. I have thought of numerous simple solutions, but none of them works for me. For now, the only thing that actually works is

  • set the series as the index
  • Query month and year from the index
  • Reconstruct a new time series using year, month and 1

It can't really be that complicated, can it? There is month start, but is unfortunately an offset, that's of no use here. There seems to be no set() function for the method, and even less functionality while the series is a column, and not (part of) the index itself.

The only related question was this, but the trick used there is not applicable here.

like image 557
FooBar Avatar asked Mar 05 '15 22:03

FooBar


2 Answers

The other answer works, but any time you use apply, you slow your code down a lot. I was able to get an 8.5x speedup by writing a quick vectorized Datetime replace for a series.

def vec_dt_replace(series, year=None, month=None, day=None):
    return pd.to_datetime(
        {'year': series.dt.year if year is None else year,
         'month': series.dt.month if month is None else month,
         'day': series.dt.day if day is None else day})

Apply:

%timeit dtseries.apply(lambda dt: dt.replace(day=1))
# 4.17 s ± 38.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Vectorized:

%timeit vec_dt_replace(dtseries, day=1)
# 491 ms ± 6.48 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Note that you could face errors by trying to change dates to ones that don't exist, like trying to change 2012-02-29 to 2013-02-29. Use the errors argument of pd.to_datetime to ignore or coerce them.

Data generation: Generate series with 1 million random dates:

import pandas as pd
import numpy as np

# Generate random dates. Modified from: https://stackoverflow.com/a/50668285
def pp(start, end, n):
    start_u = start.value // 10 ** 9
    end_u = end.value // 10 ** 9

    return pd.Series(
        (10 ** 9 * np.random.randint(start_u, end_u, n)).view('M8[ns]'))

start = pd.to_datetime('2015-01-01')
end = pd.to_datetime('2018-01-01')
dtseries = pp(start, end, 1000000)
# Remove time component
dtseries = dtseries.dt.normalize()
like image 34
Kyle Barron Avatar answered Sep 25 '22 00:09

Kyle Barron


You can use .apply and datetime.replace, eg:

import pandas as pd
from datetime import datetime

ps = pd.Series([datetime(2014, 1, 7), datetime(2014, 3, 13), datetime(2014, 6, 12)])
new = ps.apply(lambda dt: dt.replace(day=1))

Gives:

0   2014-01-01
1   2014-03-01
2   2014-06-01
dtype: datetime64[ns]
like image 168
Jon Clements Avatar answered Sep 23 '22 00:09

Jon Clements