I was wondering if pandas is capable of automatically detecting which columns are datetime objects and read those columns in as dates instead of strings? I am looking at the api and related stack overflow posts but I can't seem to figure it out. This is a black-box system that takes in arbitrary csv schema on production so I do not what the column names will be. This seems like it would work but you have to know which columns are date fields: <pre class="prettyprint"><code>import pandas as pd #creating the test data df = pd.DataFrame({'0': ['a', 'b', 'c'], '1': ['2015-12-27','2015-12-28', '2015-12-29'], '2': [11,12,13]}) df.to_csv('test.csv', index=False) #loading the test data df = pd.read_csv('test.csv', parse_dates=True) print df.dtypes # prints (object, object, int64) instead of (object,datetime, int64) </code></pre> <blockquote> I am thinking if it cannot do this, then I can write something that: <ol> <li>Finds columns with string type.</li> <li>Grab a few unique values and try to parse them.</li> <li>If successful then try to parse the whole column.</li> </ol> </blockquote> Edit. I wrote a simple method <code>convertDateColumns</code> that will do this: <pre class="prettyprint"><code>import pandas as pd from dateutil import parser def convertDateColumns(self, df): object_cols = df.columns.values[df.dtypes.values == 'object'] date_cols = [c for c in object_cols if testIfColumnIsDate(df[c], num_tries=3)] for col in date_cols: try: df[col] = pd.to_datetime(df[col], coerce=True, infer_datetime_format=True) except ValueError: pass return df def testIfColumnIsDate(series, num_tries=4): """ Test if a column contains date values. This can try a few times for the scenerio where a date column may have a couple of null or missing values but we still want to parse when possible (and convert those null/missing to NaD values) """ if series.dtype != 'object': return False vals = set() for val in series: vals.add(val) if len(vals) > num_tries: break for val in list(vals): try: if type(val) is int: continue parser.parse(val) return True except ValueError: pass return False </code></pre>

You can get rid of the <code>for</code> loop by using the parameter <code>errors='ignore'</code> to avoid modifying unwanted values. In the code below we apply a <code>to_datetime</code> transformation (ignoring errors) on all <code>object</code> columns -- other columns are returned as is. <blockquote> If <code>ignore</code>, then invalid parsing will return the input </blockquote> <pre class="prettyprint"><code>df = df.apply(lambda col: pd.to_datetime(col, errors='ignore') if col.dtypes == object else col, axis=0) df.dtypes # 0 object # 1 datetime64[ns] # 2 int64 </code></pre>

Pandas - automatically detect date columns at run time

Tags:

python

pandas

scikit-learn

I was wondering if pandas is capable of automatically detecting which columns are datetime objects and read those columns in as dates instead of strings?

I am looking at the api and related stack overflow posts but I can't seem to figure it out.

This is a black-box system that takes in arbitrary csv schema on production so I do not what the column names will be.

This seems like it would work but you have to know which columns are date fields:

import pandas as pd

#creating the test data
df = pd.DataFrame({'0': ['a', 'b', 'c'], '1': ['2015-12-27','2015-12-28', '2015-12-29'], '2': [11,12,13]})
df.to_csv('test.csv', index=False)

#loading the test data
df = pd.read_csv('test.csv', parse_dates=True)
print df.dtypes 
# prints (object, object, int64) instead of (object,datetime, int64)

I am thinking if it cannot do this, then I can write something that:

Finds columns with string type.

Grab a few unique values and try to parse them.

If successful then try to parse the whole column.

Edit. I wrote a simple method convertDateColumns that will do this:

import pandas as pd
from dateutil import parser

def convertDateColumns(self, df):
    object_cols = df.columns.values[df.dtypes.values == 'object']
    date_cols = [c for c in object_cols if testIfColumnIsDate(df[c], num_tries=3)]

    for col in date_cols:
        try:
            df[col] = pd.to_datetime(df[col], coerce=True, infer_datetime_format=True)
        except ValueError:
            pass

    return df

def testIfColumnIsDate(series, num_tries=4):
""" Test if a column contains date values.
    This can try a few times for the scenerio where a date column may have
    a couple of null or missing values but we still want to parse when
    possible (and convert those null/missing to NaD values)
"""
    if series.dtype != 'object':
        return False

    vals = set()
    for val in series:
        vals.add(val)
        if len(vals) > num_tries:
            break

    for val in list(vals):
        try:
            if type(val) is int:
                continue

            parser.parse(val)
            return True
        except ValueError:
            pass

    return False

859

asked Oct 18 '15 23:10

anthonybell

2 Answers

I would use pd.to_datetime, and catch exceptions on columns that don't work. For example:

import pandas as pd

df = pd.read_csv('test.csv')

for col in df.columns:
    if df[col].dtype == 'object':
        try:
            df[col] = pd.to_datetime(df[col])
        except ValueError:
            pass

df.dtypes
# (object, datetime64[ns], int64)

I believe this is as close to "automatic" as you can get for this application.

157

answered Sep 20 '22 14:09

jakevdp

You can get rid of the for loop by using the parameter errors='ignore' to avoid modifying unwanted values. In the code below we apply a to_datetime transformation (ignoring errors) on all object columns -- other columns are returned as is.

If ignore, then invalid parsing will return the input

df = df.apply(lambda col: pd.to_datetime(col, errors='ignore') 
              if col.dtypes == object 
              else col, 
              axis=0)

df.dtypes

# 0            object
# 1    datetime64[ns]
# 2             int64

answered Sep 21 '22 14:09

Romain

Related questions
                            
                                Python, Windows, Ansi - encoding, again
                            
                                How can I enable / disable QTableWidget's horizontal / vertical header?
                            
                                What's equivalent to Django's auto_now, auto_now_add in SQLAlchemy?
                            
                                How to tell tox to use PyPI mirrors for installing packages?
                            
                                Paramiko: read from standard output of remotely executed command
                            
                                empty error message in python
                            
                                Convert command line arguments to regular expression
                            
                                load static file with variable name in django
                            
                                Update a MongoEngine document using a python dict?
                            
                                Pythonic way to correctly separate Model from application using SQLAlchemy
                            
                                How would I implement a dict with Abstract Base Classes in Python? [duplicate]
                            
                                Why is virtualenv necessary?
                            
                                python keyword arguments with hyphen
                            
                                Flask: How to serve static html?
                            
                                Find the shortest distance between a point and line segments (not line)
                            
                                Reading a JSON string | TypeError: string indices must be integers
                            
                                How to generate random points in a circular distribution
                            
                                convert sound to list of phonemes in python
                            
                                AttributeError: 'str' object has no attribute 'fileno'
                            
                                How to pass data between django views

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Pandas - automatically detect date columns at run time

Tags:

python

pandas

scikit-learn

anthonybell

People also ask

2 Answers

jakevdp

Romain

Recent Activity

Donate For Us

Pandas - automatically detect date columns **at run time**

Tags:

python

pandas

scikit-learn

anthonybell

People also ask

2 Answers

jakevdp

Romain

Related questions

Recent Activity

Donate For Us

Pandas - automatically detect date columns at run time