Infer which columns are datetime

Tags:

I have a huge dataframe with many columns, many of which are of type datetime.datetime. The problem is that many also have mixed types, including for instance datetime.datetime values and None values (and potentially other invalid values):

0         2017-07-06 00:00:00
1         2018-02-27 21:30:05
2         2017-04-12 00:00:00
3         2017-05-21 22:05:00
4         2018-01-22 00:00:00
                 ...         
352867    2019-10-04 00:00:00
352868                   None
352869            some_string
Name: colx, Length: 352872, dtype: object

Hence resulting in an object type column. This can be solved with df.colx.fillna(pd.NaT). The problem is that the dataframe is too big to search for individual columns.

Another approach is to use pd.to_datetime(col, errors='coerce'), however this will cast to datetime many columns that contain numerical values.

I could also do df.fillna(float('nan'), inplace=True), though the columns containing dates are still of object type, and would still have the same problem.

What approach could I follow to cast to datetime those columns whose values really do contain datetime values, but could also contain None, and potentially some invalid values (mentioning since otherwise a pd.to_datetime in a try/except clause would do)? Something like a flexible version of pd.to_datetime(col)

864

asked Oct 28 '19 15:10

yatu

1 Answers

The main problem I see is when parsing numerical values.

I'd propose converting them to strings first

Setup

dat = {
    'index': [0, 1, 2, 3, 4, 352867, 352868, 352869],
    'columns': ['Mixed', 'Numeric Values', 'Strings'],
    'data': [
        ['2017-07-06 00:00:00', 1, 'HI'],
        ['2018-02-27 21:30:05', 1, 'HI'],
        ['2017-04-12 00:00:00', 1, 'HI'],
        ['2017-05-21 22:05:00', 1, 'HI'],
        ['2018-01-22 00:00:00', 1, 'HI'],
        ['2019-10-04 00:00:00', 1, 'HI'],
        ['None', 1, 'HI'],
        ['some_string', 1, 'HI']
    ]
}

df = pd.DataFrame(**dat)

df

                      Mixed  Numeric Values Strings
0       2017-07-06 00:00:00               1      HI
1       2018-02-27 21:30:05               1      HI
2       2017-04-12 00:00:00               1      HI
3       2017-05-21 22:05:00               1      HI
4       2018-01-22 00:00:00               1      HI
352867  2019-10-04 00:00:00               1      HI
352868                 None               1      HI
352869          some_string               1      HI

Solution

df.astype(str).apply(pd.to_datetime, errors='coerce')

                     Mixed Numeric Values Strings
0      2017-07-06 00:00:00            NaT     NaT
1      2018-02-27 21:30:05            NaT     NaT
2      2017-04-12 00:00:00            NaT     NaT
3      2017-05-21 22:05:00            NaT     NaT
4      2018-01-22 00:00:00            NaT     NaT
352867 2019-10-04 00:00:00            NaT     NaT
352868                 NaT            NaT     NaT
352869                 NaT            NaT     NaT

answered Oct 03 '22 00:10

piRSquared

Related questions
                            
                                Loading Images in a Directory As Tensorflow Data set
                            
                                '{0}'.format() is faster than str() and '{}'.format() using IPython %timeit and otherwise using pure Python
                            
                                Using the URLconf defined in mysite.urls, Django tried these URL patterns, in this order:
                            
                                PyCharm - Expected type 'Optional[IO[str]]', got 'TextIOWrapper[str]' instead
                            
                                What is the different between the get logger functions from celery.utils.log and logging?
                            
                                How to convert Python numpy array to base64 output
                            
                                What is the difference between a statement and a function in Python?
                            
                                How to control when to compute evaluation vs training using the Estimator API of tensorflow?
                            
                                Why changing start method to 'spawn' from 'fork' in Python multiprocessing does not allow me run my job anymore?
                            
                                Curious memory consumption of pandas.unique()
                            
                                Realtime offline speech recognition in Python
                            
                                Extract Python function source text from the source code string
                            
                                Memoization of method working on python 3.6 but not on 3.7.3
                            
                                Memory leaks when using pandas_udf and Parquet serialization?
                            
                                What Does Django static(settings.STATIC_URL, document_root=settings.STATIC_ROOT) Actually DO?
                            
                                What has to be inside tf.distribute.Strategy.scope()?
                            
                                What is the difference between model.LGBMRegressor.fit(x_train, y_train) and lightgbm.train(train_data, valid_sets = test_data)?
                            
                                Unique together involving multiple foreign keys & a many to many field
                            
                                "AssertionError: Torch not compiled with CUDA enabled" in spite upgrading to CUDA version
                            
                                How to create JPEG compressed DICOM dataset using pydicom?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Infer which columns are datetime

Tags:

python

pandas