I have read-only access to a database that I query and read into a Pandas dataframe using pymssql. One of the variables contains dates, some of which are stored as midnight on 01 Jan 0001 (i.e. 0001-01-01 00:00:00.0000000). I've no idea why those dates should be included – as far as I know, they are not recognised as a valid date by SQL Server and they are probably due to some default data entry. Nevertheless, that's what I have to work with. This can be recreated as a dataframe as follows:
import numpy as np
import pandas as pd
tempDF = pd.DataFrame({ 'id': [0,1,2,3,4],
'date': ['0001-01-01 00:00:00.0000000',
'2015-05-22 00:00:00.0000000',
'0001-01-01 00:00:00.0000000',
'2015-05-06 00:00:00.0000000',
'2015-05-03 00:00:00.0000000']})
The dataframe looks like:
print(tempDF)
date id
0 0001-01-01 00:00:00.0000000 0
1 2015-05-22 00:00:00.0000000 1
2 0001-01-01 00:00:00.0000000 2
3 2015-05-06 00:00:00.0000000 3
4 2015-05-03 00:00:00.0000000 4
... with the following dtypes:
print(tempDF.dtypes)
date object
id int64
dtype: object
print(tempDF.dtypes)
However, I routinely convert date fields in the dataframe to datetime format using:
tempDF['date'] = pd.to_datetime(tempDF['date'])
However, by chance, I've noticed that the 0001-01-01 date is converted to 2001-01-01.
print(tempDF)
date id
0 2001-01-01 0
1 2015-05-22 1
2 2001-01-01 2
3 2015-05-06 3
4 2015-05-03 4
I realise that the dates in the original database are incorrect because SQL Server doesn't see 0001-01-01 as a valid date. But at least in the 0001-01-01 format, such missing data are easy to identify within my Pandas dataframe. However, when pandas.to_datetime() changes these dates so they lie within a feasible range, it is very easy to miss such outliers.
How can I make sure that pd.to_datetime doesn't interpret the outlier dates incorrectly?
If you provide a format
, these dates will not be recognized:
In [92]: pd.to_datetime(tempDF['date'], format="%Y-%m-%d %H:%M:%S.%f", errors='coerce')
Out[92]:
0 NaT
1 2015-05-22
2 NaT
3 2015-05-06
4 2015-05-03
Name: date, dtype: datetime64[ns]
By default it will error, but by passing errors='coerce'
, they are converted to NaT values (coerce=True
for older pandas versions).
The reason pandas converts these "0001-01-01" dates to "2001-01-01" without providing a format
, is because this is the behaviour of dateutil
:
In [32]: import dateutil
In [33]: dateutil.parser.parse("0001-01-01")
Out[33]: datetime.datetime(2001, 1, 1, 0, 0)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With