Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Infer which columns are datetime

Tags:

python

pandas

I have a huge dataframe with many columns, many of which are of type datetime.datetime. The problem is that many also have mixed types, including for instance datetime.datetime values and None values (and potentially other invalid values):

0         2017-07-06 00:00:00
1         2018-02-27 21:30:05
2         2017-04-12 00:00:00
3         2017-05-21 22:05:00
4         2018-01-22 00:00:00
                 ...         
352867    2019-10-04 00:00:00
352868                   None
352869            some_string
Name: colx, Length: 352872, dtype: object

Hence resulting in an object type column. This can be solved with df.colx.fillna(pd.NaT). The problem is that the dataframe is too big to search for individual columns.

Another approach is to use pd.to_datetime(col, errors='coerce'), however this will cast to datetime many columns that contain numerical values.

I could also do df.fillna(float('nan'), inplace=True), though the columns containing dates are still of object type, and would still have the same problem.

What approach could I follow to cast to datetime those columns whose values really do contain datetime values, but could also contain None, and potentially some invalid values (mentioning since otherwise a pd.to_datetime in a try/except clause would do)? Something like a flexible version of pd.to_datetime(col)

like image 864
yatu Avatar asked Oct 28 '19 15:10

yatu


People also ask

How do you auto detect the date datetime columns and set their datatype when reading a csv file in pandas?

You should add parse_dates=True , or parse_dates=['column name'] when reading, thats usually enough to magically parse it.

What is a datetime object in pandas?

datetime object. Timestamp is the pandas equivalent of python's Datetime and is interchangeable with it in most cases. It's the type used for the entries that make up a DatetimeIndex, and other timeseries oriented data structures in pandas.

What is DT datetime pandas?

dt. date attribute to return the date property of the underlying data of the given Series object.

What is UTC true in pandas?

If True , the function always returns a timezone-aware UTC-localized Timestamp , Series or DatetimeIndex . To do this, timezone-naive inputs are localized as UTC, while timezone-aware inputs are converted to UTC. If False (default), inputs will not be coerced to UTC.


1 Answers

The main problem I see is when parsing numerical values.

I'd propose converting them to strings first


Setup

dat = {
    'index': [0, 1, 2, 3, 4, 352867, 352868, 352869],
    'columns': ['Mixed', 'Numeric Values', 'Strings'],
    'data': [
        ['2017-07-06 00:00:00', 1, 'HI'],
        ['2018-02-27 21:30:05', 1, 'HI'],
        ['2017-04-12 00:00:00', 1, 'HI'],
        ['2017-05-21 22:05:00', 1, 'HI'],
        ['2018-01-22 00:00:00', 1, 'HI'],
        ['2019-10-04 00:00:00', 1, 'HI'],
        ['None', 1, 'HI'],
        ['some_string', 1, 'HI']
    ]
}

df = pd.DataFrame(**dat)

df

                      Mixed  Numeric Values Strings
0       2017-07-06 00:00:00               1      HI
1       2018-02-27 21:30:05               1      HI
2       2017-04-12 00:00:00               1      HI
3       2017-05-21 22:05:00               1      HI
4       2018-01-22 00:00:00               1      HI
352867  2019-10-04 00:00:00               1      HI
352868                 None               1      HI
352869          some_string               1      HI

Solution

df.astype(str).apply(pd.to_datetime, errors='coerce')

                     Mixed Numeric Values Strings
0      2017-07-06 00:00:00            NaT     NaT
1      2018-02-27 21:30:05            NaT     NaT
2      2017-04-12 00:00:00            NaT     NaT
3      2017-05-21 22:05:00            NaT     NaT
4      2018-01-22 00:00:00            NaT     NaT
352867 2019-10-04 00:00:00            NaT     NaT
352868                 NaT            NaT     NaT
352869                 NaT            NaT     NaT
like image 78
piRSquared Avatar answered Oct 03 '22 00:10

piRSquared