I have a dataset, in which the hour is recorded as [0100:2400], instead of [0000:2300]
For example
pd.to_datetime('201704102300', format='%Y%m%d%H%M')
returns
Timestamp('2017-04-10 20:00:00')
But
pd.to_datetime('201704102400', format='%Y%m%d%H%M')
gives me the error:
ValueError: unconverted data remains: 0
How can I fix this problem?
I can manually adjust the data, such as mentioned in this SO Post, but I think pandas should have handled this case already?
UPDATE:
And how to do it in a scalable way for dataframe? For example, the data look like this

The date-time default format is “YYYY-MM-DD”. Hence, December 8, 2020, in the date format will be presented as “2020-12-08”. The datetime format can be changed and by changing we mean changing the sequence and style of the format.
What is UTC true in pandas? If True , the function always returns a timezone-aware UTC-localized Timestamp , Series or DatetimeIndex . To do this, timezone-naive inputs are localized as UTC, while timezone-aware inputs are converted to UTC. If False (default), inputs will not be coerced to UTC.
To remove timezone from tz-aware DatetimeIndex , use tz_localize(None) or tz_convert(None) . tz_localize(None) will remove timezone holding local time representations. tz_convert(None) will remove timezone after converting to UTC time.
Pandas uses the system strptime, and so if you need something non-standard, you get to roll your own.
Code:
import pandas as pd
import datetime as dt
def my_to_datetime(date_str):
    if date_str[8:10] != '24':
        return pd.to_datetime(date_str, format='%Y%m%d%H%M')
    date_str = date_str[0:8] + '00' + date_str[10:]
    return pd.to_datetime(date_str, format='%Y%m%d%H%M') + \
           dt.timedelta(days=1)
print(my_to_datetime('201704102400'))
Results:
2017-04-11 00:00:00
For a Column in a pandas.DataFrame:
df['time'] = df.time.apply(my_to_datetime)
                        Vectorized solution, which uses pd.to_datetime(DataFrame) method:
Source DF
In [27]: df
Out[27]:
           time
0  201704102400
1  201602282400
2  201704102359
Solution
In [28]: pat = '(?P<year>\d{4})(?P<month>\d{2})(?P<day>\d{2})(?P<hour>\d{2})(?P<minute>\d{2})'
In [29]: pd.to_datetime(df['time'].str.extract(pat, expand=True))
Out[29]:
0   2017-04-11 00:00:00
1   2016-02-29 00:00:00
2   2017-04-10 23:59:00
dtype: datetime64[ns]
Explanation:
In [30]: df['time'].str.extract(pat, expand=True)
Out[30]:
   year month day hour minute
0  2017    04  10   24     00
1  2016    02  28   24     00
2  2017    04  10   23     59
pat is the RegEx pattern argument in the Series.str.extract() function
UPDATE: Timing
In [37]: df = pd.concat([df] * 10**4, ignore_index=True)
In [38]: df.shape
Out[38]: (30000, 1)
In [39]: %timeit df.time.apply(my_to_datetime)
1 loop, best of 3: 4.1 s per loop
In [40]: %timeit pd.to_datetime(df['time'].str.extract(pat, expand=True))
1 loop, best of 3: 475 ms per loop
                        If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With