I'm using Pandas' read_sql() function to read multiple SQL tables into DataFrames. This function requires a pre-specified list of column names which should be read as datetime objects using the 'parse_dates' parameter but the function does not infer datetimes automatically from varchar columns in the server. Because of this, I get DataFrames in which all columns are of dtype Object.
col1 col2
-----------------------------------
0 A 2017-02-04 10:41:00.0000000
1 B 2017-02-04 10:41:00.0000000
2 C 2017-02-04 10:41:00.0000000
3 D 2017-02-04 10:41:00.0000000
4 E 2017-02-03 06:13:00.0000000
Is there a built-in Pandas function to automatically infer columns which should be datetime64[ns] WITHOUT having to specify the column names?
I've tried:
df.apply(pd.to_datetime(df, infer_datetime_format=True), axis=1)
which results in an error:
to assemble mappings requires at least that [year, month, day] be specified: [day,month,year] is missing
I also tried:
pd.to_datetime(df.stack(), errors='ignore', format='%Y%m%d% H%M%S%f').unstack()
and
pd.to_datetime(df.stack(), errors='coerce', format='%Y%m%d% H%M%S%f').unstack()
But this does not work.
Any suggestions about how to infer datetime columns automatically after the DataFrame is constructed?
Pandas has a built-in function called to_datetime()that converts date and time in string format to a DateTime object. As you can see, the 'date' column in the DataFrame is currently of a string-type object. Thus, to_datetime() converts the column to a series of the appropriate datetime64 dtype.
To select a single column, use square brackets [] with the column name of the column of interest.
We can use the parse_dates parameter to convince pandas to turn things into real datetime types. parse_dates takes a list of columns (since you could want to parse multiple columns into datetimes ).
There is no builtin currently to convert object to datetime automatically. One simple way is based on list comprehension and regex pattern of the datetime varchar ie.
If you have a df (based on @Alexander's df)
df = pd.DataFrame( {'col1': ['A', 'B', 'C', 'D', 'E'],
'col2': ['2017-02-04 18:41:00',
'2017-02-04 18:41:00',
'2017-02-04 18:41:00',
'2017-02-04 18:41:00',
'2017-02-03 14:13:00'],
'col3': [0, 1, 2, 3, 4],
'col4': ['2017-02-04 18:41:00',
'2017-02-04 18:41:00',
'2017-02-04 18:41:00',
'2017-02-04 18:41:00',
'2017-02-03 14:13:00']})
data = [pd.to_datetime(df[x]) if df[x].astype(str).str.match(r'\d{4}-\d{2}-\d{2} \d{2}\:\d{2}\:\d{2}').all() else df[x] for x in df.columns]
df = pd.concat(data, axis=1, keys=[s.name for s in data])
or with the help of a mask i.e
mask = df.astype(str).apply(lambda x : x.str.match(r'\d{4}-\d{2}-\d{2} \d{2}\:\d{2}\:\d{2}').all())
df.loc[:,mask] = df.loc[:,mask].apply(pd.to_datetime)
df.types
Output:
col1 object col2 datetime64[ns] col3 int64 col4 datetime64[ns] dtype: object
If you have mixed date formats then you can use r'(\d{2,4}-\d{2}-\d{2,4})+'
Eg:
ndf = pd.DataFrame({'col3': [0, 1, 2, 3, 4],
'col4': ['2017-02-04 18:41:00',
'2017-02-04 18:41:00',
'2017-02-04 18:41:00',
'2017-02-04 18:41:00',
'2017-02-03 14:13:00'],
'col5': ['2017-02-04',
'2017-02-04',
'17-02-2004 14:13:00',
'17-02-2014',
'2017-02-03']})
mask = ndf.astype(str).apply(lambda x : x.str.match(r'(\d{2,4}-\d{2}-\d{2,4})+').all())
ndf.loc[:,mask] = ndf.loc[:,mask].apply(pd.to_datetime)
Output :
col3 col4 col5 0 0 2017-02-04 18:41:00 2017-02-04 00:00:00 1 1 2017-02-04 18:41:00 2017-02-04 00:00:00 2 2 2017-02-04 18:41:00 2004-02-17 14:13:00 3 3 2017-02-04 18:41:00 2014-02-17 00:00:00 4 4 2017-02-03 14:13:00 2017-02-03 00:00:00
Hope it helps
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With