I was wondering if pandas is capable of automatically detecting which columns are datetime objects and read those columns in as dates instead of strings?
I am looking at the api and related stack overflow posts but I can't seem to figure it out.
This is a black-box system that takes in arbitrary csv schema on production so I do not what the column names will be.
This seems like it would work but you have to know which columns are date fields:
import pandas as pd
#creating the test data
df = pd.DataFrame({'0': ['a', 'b', 'c'], '1': ['2015-12-27','2015-12-28', '2015-12-29'], '2': [11,12,13]})
df.to_csv('test.csv', index=False)
#loading the test data
df = pd.read_csv('test.csv', parse_dates=True)
print df.dtypes
# prints (object, object, int64) instead of (object,datetime, int64)
I am thinking if it cannot do this, then I can write something that:
- Finds columns with string type.
- Grab a few unique values and try to parse them.
- If successful then try to parse the whole column.
Edit. I wrote a simple method convertDateColumns
that will do this:
import pandas as pd
from dateutil import parser
def convertDateColumns(self, df):
object_cols = df.columns.values[df.dtypes.values == 'object']
date_cols = [c for c in object_cols if testIfColumnIsDate(df[c], num_tries=3)]
for col in date_cols:
try:
df[col] = pd.to_datetime(df[col], coerce=True, infer_datetime_format=True)
except ValueError:
pass
return df
def testIfColumnIsDate(series, num_tries=4):
""" Test if a column contains date values.
This can try a few times for the scenerio where a date column may have
a couple of null or missing values but we still want to parse when
possible (and convert those null/missing to NaD values)
"""
if series.dtype != 'object':
return False
vals = set()
for val in series:
vals.add(val)
if len(vals) > num_tries:
break
for val in list(vals):
try:
if type(val) is int:
continue
parser.parse(val)
return True
except ValueError:
pass
return False
We can use the parse_dates parameter to convince pandas to turn things into real datetime types. parse_dates takes a list of columns (since you could want to parse multiple columns into datetimes ).
Pandas has a built-in function called to_datetime()that converts date and time in string format to a DateTime object. As you can see, the 'date' column in the DataFrame is currently of a string-type object. Thus, to_datetime() converts the column to a series of the appropriate datetime64 dtype.
between() to Two Dates. You can use pandas. Series. between() method to select DataFrame rows between two dates.
The values property is used to get a Numpy representation of the DataFrame. Only the values in the DataFrame will be returned, the axes labels will be removed. The values of the DataFrame. A DataFrame where all columns are the same type (e.g., int64) results in an array of the same type.
I would use pd.to_datetime
, and catch exceptions on columns that don't work. For example:
import pandas as pd
df = pd.read_csv('test.csv')
for col in df.columns:
if df[col].dtype == 'object':
try:
df[col] = pd.to_datetime(df[col])
except ValueError:
pass
df.dtypes
# (object, datetime64[ns], int64)
I believe this is as close to "automatic" as you can get for this application.
You can get rid of the for
loop by using the parameter errors='ignore'
to avoid modifying unwanted values. In the code below we apply a to_datetime
transformation (ignoring errors) on all object
columns -- other columns are returned as is.
If
ignore
, then invalid parsing will return the input
df = df.apply(lambda col: pd.to_datetime(col, errors='ignore')
if col.dtypes == object
else col,
axis=0)
df.dtypes
# 0 object
# 1 datetime64[ns]
# 2 int64
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With