Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas - automatically detect date columns **at run time**

I was wondering if pandas is capable of automatically detecting which columns are datetime objects and read those columns in as dates instead of strings?

I am looking at the api and related stack overflow posts but I can't seem to figure it out.

This is a black-box system that takes in arbitrary csv schema on production so I do not what the column names will be.

This seems like it would work but you have to know which columns are date fields:

import pandas as pd

#creating the test data
df = pd.DataFrame({'0': ['a', 'b', 'c'], '1': ['2015-12-27','2015-12-28', '2015-12-29'], '2': [11,12,13]})
df.to_csv('test.csv', index=False)

#loading the test data
df = pd.read_csv('test.csv', parse_dates=True)
print df.dtypes 
# prints (object, object, int64) instead of (object,datetime, int64)

I am thinking if it cannot do this, then I can write something that:

  1. Finds columns with string type.
  2. Grab a few unique values and try to parse them.
  3. If successful then try to parse the whole column.

Edit. I wrote a simple method convertDateColumns that will do this:

import pandas as pd
from dateutil import parser

def convertDateColumns(self, df):
    object_cols = df.columns.values[df.dtypes.values == 'object']
    date_cols = [c for c in object_cols if testIfColumnIsDate(df[c], num_tries=3)]

    for col in date_cols:
        try:
            df[col] = pd.to_datetime(df[col], coerce=True, infer_datetime_format=True)
        except ValueError:
            pass

    return df

def testIfColumnIsDate(series, num_tries=4):
""" Test if a column contains date values.
    This can try a few times for the scenerio where a date column may have
    a couple of null or missing values but we still want to parse when
    possible (and convert those null/missing to NaD values)
"""
    if series.dtype != 'object':
        return False

    vals = set()
    for val in series:
        vals.add(val)
        if len(vals) > num_tries:
            break

    for val in list(vals):
        try:
            if type(val) is int:
                continue

            parser.parse(val)
            return True
        except ValueError:
            pass

    return False
like image 859
anthonybell Avatar asked Oct 18 '15 23:10

anthonybell


People also ask

What does parse_dates do in pandas?

We can use the parse_dates parameter to convince pandas to turn things into real datetime types. parse_dates takes a list of columns (since you could want to parse multiple columns into datetimes ).

How do I work with dates and times in pandas?

Pandas has a built-in function called to_datetime()that converts date and time in string format to a DateTime object. As you can see, the 'date' column in the DataFrame is currently of a string-type object. Thus, to_datetime() converts the column to a series of the appropriate datetime64 dtype.

How do you check if a date is between two dates in pandas?

between() to Two Dates. You can use pandas. Series. between() method to select DataFrame rows between two dates.

What does .values in pandas do?

The values property is used to get a Numpy representation of the DataFrame. Only the values in the DataFrame will be returned, the axes labels will be removed. The values of the DataFrame. A DataFrame where all columns are the same type (e.g., int64) results in an array of the same type.


2 Answers

I would use pd.to_datetime, and catch exceptions on columns that don't work. For example:

import pandas as pd

df = pd.read_csv('test.csv')

for col in df.columns:
    if df[col].dtype == 'object':
        try:
            df[col] = pd.to_datetime(df[col])
        except ValueError:
            pass

df.dtypes
# (object, datetime64[ns], int64)

I believe this is as close to "automatic" as you can get for this application.

like image 157
jakevdp Avatar answered Sep 20 '22 14:09

jakevdp


You can get rid of the for loop by using the parameter errors='ignore' to avoid modifying unwanted values. In the code below we apply a to_datetime transformation (ignoring errors) on all object columns -- other columns are returned as is.

If ignore, then invalid parsing will return the input

df = df.apply(lambda col: pd.to_datetime(col, errors='ignore') 
              if col.dtypes == object 
              else col, 
              axis=0)

df.dtypes

# 0            object
# 1    datetime64[ns]
# 2             int64
like image 37
Romain Avatar answered Sep 21 '22 14:09

Romain