Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas: parsing 24:00 instead of 00:00

I have a dataset, in which the hour is recorded as [0100:2400], instead of [0000:2300]

For example

pd.to_datetime('201704102300', format='%Y%m%d%H%M')

returns

Timestamp('2017-04-10 20:00:00')

But

pd.to_datetime('201704102400', format='%Y%m%d%H%M')

gives me the error:

ValueError: unconverted data remains: 0

How can I fix this problem?

I can manually adjust the data, such as mentioned in this SO Post, but I think pandas should have handled this case already?

UPDATE:

And how to do it in a scalable way for dataframe? For example, the data look like this enter image description here

like image 660
cqcn1991 Avatar asked Apr 12 '17 02:04

cqcn1991


People also ask

How do I change time format in pandas?

The date-time default format is “YYYY-MM-DD”. Hence, December 8, 2020, in the date format will be presented as “2020-12-08”. The datetime format can be changed and by changing we mean changing the sequence and style of the format.

What is UTC true in pandas?

What is UTC true in pandas? If True , the function always returns a timezone-aware UTC-localized Timestamp , Series or DatetimeIndex . To do this, timezone-naive inputs are localized as UTC, while timezone-aware inputs are converted to UTC. If False (default), inputs will not be coerced to UTC.

How do I get rid of pandas time zone?

To remove timezone from tz-aware DatetimeIndex , use tz_localize(None) or tz_convert(None) . tz_localize(None) will remove timezone holding local time representations. tz_convert(None) will remove timezone after converting to UTC time.


2 Answers

Pandas uses the system strptime, and so if you need something non-standard, you get to roll your own.

Code:

import pandas as pd
import datetime as dt

def my_to_datetime(date_str):
    if date_str[8:10] != '24':
        return pd.to_datetime(date_str, format='%Y%m%d%H%M')

    date_str = date_str[0:8] + '00' + date_str[10:]
    return pd.to_datetime(date_str, format='%Y%m%d%H%M') + \
           dt.timedelta(days=1)

print(my_to_datetime('201704102400'))

Results:

2017-04-11 00:00:00

For a Column in a pandas.DataFrame:

df['time'] = df.time.apply(my_to_datetime)
like image 183
Stephen Rauch Avatar answered Oct 17 '22 22:10

Stephen Rauch


Vectorized solution, which uses pd.to_datetime(DataFrame) method:

Source DF

In [27]: df
Out[27]:
           time
0  201704102400
1  201602282400
2  201704102359

Solution

In [28]: pat = '(?P<year>\d{4})(?P<month>\d{2})(?P<day>\d{2})(?P<hour>\d{2})(?P<minute>\d{2})'

In [29]: pd.to_datetime(df['time'].str.extract(pat, expand=True))
Out[29]:
0   2017-04-11 00:00:00
1   2016-02-29 00:00:00
2   2017-04-10 23:59:00
dtype: datetime64[ns]

Explanation:

In [30]: df['time'].str.extract(pat, expand=True)
Out[30]:
   year month day hour minute
0  2017    04  10   24     00
1  2016    02  28   24     00
2  2017    04  10   23     59

pat is the RegEx pattern argument in the Series.str.extract() function

UPDATE: Timing

In [37]: df = pd.concat([df] * 10**4, ignore_index=True)

In [38]: df.shape
Out[38]: (30000, 1)

In [39]: %timeit df.time.apply(my_to_datetime)
1 loop, best of 3: 4.1 s per loop

In [40]: %timeit pd.to_datetime(df['time'].str.extract(pat, expand=True))
1 loop, best of 3: 475 ms per loop
like image 34
MaxU - stop WAR against UA Avatar answered Oct 17 '22 20:10

MaxU - stop WAR against UA