Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I prevent pandas.to_datetime() function from converting 0001-01-01 to 2001-01-01

I have read-only access to a database that I query and read into a Pandas dataframe using pymssql. One of the variables contains dates, some of which are stored as midnight on 01 Jan 0001 (i.e. 0001-01-01 00:00:00.0000000). I've no idea why those dates should be included – as far as I know, they are not recognised as a valid date by SQL Server and they are probably due to some default data entry. Nevertheless, that's what I have to work with. This can be recreated as a dataframe as follows:

import numpy as np
import pandas as pd

tempDF = pd.DataFrame({ 'id': [0,1,2,3,4],
                        'date': ['0001-01-01 00:00:00.0000000',
                                 '2015-05-22 00:00:00.0000000',
                                 '0001-01-01 00:00:00.0000000',
                                 '2015-05-06 00:00:00.0000000',
                                 '2015-05-03 00:00:00.0000000']})

The dataframe looks like:

print(tempDF)
                          date  id
0  0001-01-01 00:00:00.0000000   0
1  2015-05-22 00:00:00.0000000   1
2  0001-01-01 00:00:00.0000000   2
3  2015-05-06 00:00:00.0000000   3
4  2015-05-03 00:00:00.0000000   4

... with the following dtypes:

print(tempDF.dtypes)

date    object
id       int64
dtype: object
print(tempDF.dtypes)

However, I routinely convert date fields in the dataframe to datetime format using:

tempDF['date'] = pd.to_datetime(tempDF['date'])

However, by chance, I've noticed that the 0001-01-01 date is converted to 2001-01-01.

print(tempDF)

        date  id
0 2001-01-01   0
1 2015-05-22   1
2 2001-01-01   2
3 2015-05-06   3
4 2015-05-03   4

I realise that the dates in the original database are incorrect because SQL Server doesn't see 0001-01-01 as a valid date. But at least in the 0001-01-01 format, such missing data are easy to identify within my Pandas dataframe. However, when pandas.to_datetime() changes these dates so they lie within a feasible range, it is very easy to miss such outliers.

How can I make sure that pd.to_datetime doesn't interpret the outlier dates incorrectly?

like image 854
user1718097 Avatar asked Feb 14 '16 12:02

user1718097


1 Answers

If you provide a format, these dates will not be recognized:

In [92]: pd.to_datetime(tempDF['date'], format="%Y-%m-%d %H:%M:%S.%f", errors='coerce')
Out[92]:
0          NaT
1   2015-05-22
2          NaT
3   2015-05-06
4   2015-05-03
Name: date, dtype: datetime64[ns]

By default it will error, but by passing errors='coerce', they are converted to NaT values (coerce=True for older pandas versions).

The reason pandas converts these "0001-01-01" dates to "2001-01-01" without providing a format, is because this is the behaviour of dateutil:

In [32]: import dateutil

In [33]: dateutil.parser.parse("0001-01-01")
Out[33]: datetime.datetime(2001, 1, 1, 0, 0)
like image 99
joris Avatar answered Oct 30 '22 16:10

joris