I have a huge dataframe with many columns, many of which are of type datetime.datetime
. The problem is that many also have mixed types, including for instance datetime.datetime
values and None
values (and potentially other invalid values):
0 2017-07-06 00:00:00
1 2018-02-27 21:30:05
2 2017-04-12 00:00:00
3 2017-05-21 22:05:00
4 2018-01-22 00:00:00
...
352867 2019-10-04 00:00:00
352868 None
352869 some_string
Name: colx, Length: 352872, dtype: object
Hence resulting in an object
type column. This can be solved with df.colx.fillna(pd.NaT)
. The problem is that the dataframe is too big to search for individual columns.
Another approach is to use pd.to_datetime(col, errors='coerce')
, however this will cast to datetime
many columns that contain numerical values.
I could also do df.fillna(float('nan'), inplace=True)
, though the columns containing dates are still of object
type, and would still have the same problem.
What approach could I follow to cast to datetime those columns whose values really do contain datetime
values, but could also contain None
, and potentially some invalid values (mentioning since otherwise a pd.to_datetime
in a try
/except
clause would do)? Something like a flexible version of pd.to_datetime(col)
You should add parse_dates=True , or parse_dates=['column name'] when reading, thats usually enough to magically parse it.
datetime object. Timestamp is the pandas equivalent of python's Datetime and is interchangeable with it in most cases. It's the type used for the entries that make up a DatetimeIndex, and other timeseries oriented data structures in pandas.
dt. date attribute to return the date property of the underlying data of the given Series object.
If True , the function always returns a timezone-aware UTC-localized Timestamp , Series or DatetimeIndex . To do this, timezone-naive inputs are localized as UTC, while timezone-aware inputs are converted to UTC. If False (default), inputs will not be coerced to UTC.
The main problem I see is when parsing numerical values.
I'd propose converting them to strings first
dat = {
'index': [0, 1, 2, 3, 4, 352867, 352868, 352869],
'columns': ['Mixed', 'Numeric Values', 'Strings'],
'data': [
['2017-07-06 00:00:00', 1, 'HI'],
['2018-02-27 21:30:05', 1, 'HI'],
['2017-04-12 00:00:00', 1, 'HI'],
['2017-05-21 22:05:00', 1, 'HI'],
['2018-01-22 00:00:00', 1, 'HI'],
['2019-10-04 00:00:00', 1, 'HI'],
['None', 1, 'HI'],
['some_string', 1, 'HI']
]
}
df = pd.DataFrame(**dat)
df
Mixed Numeric Values Strings
0 2017-07-06 00:00:00 1 HI
1 2018-02-27 21:30:05 1 HI
2 2017-04-12 00:00:00 1 HI
3 2017-05-21 22:05:00 1 HI
4 2018-01-22 00:00:00 1 HI
352867 2019-10-04 00:00:00 1 HI
352868 None 1 HI
352869 some_string 1 HI
df.astype(str).apply(pd.to_datetime, errors='coerce')
Mixed Numeric Values Strings
0 2017-07-06 00:00:00 NaT NaT
1 2018-02-27 21:30:05 NaT NaT
2 2017-04-12 00:00:00 NaT NaT
3 2017-05-21 22:05:00 NaT NaT
4 2018-01-22 00:00:00 NaT NaT
352867 2019-10-04 00:00:00 NaT NaT
352868 NaT NaT NaT
352869 NaT NaT NaT
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With