Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pandas to_datetime parsing wrong year

I'm coming across something that is almost certainly a stupid mistake on my part, but I can't seem to figure out what's going on.

Essentially, I have a series of dates as strings in the format "%d-%b-%y", such as 26-Sep-05. When I go to convert them to datetime, the year is sometimes correct, but sometimes it is not.

E.g.:

dates = ['26-Sep-05', '26-Sep-05', '15-Jun-70', '5-Dec-94', '9-Jan-61', '8-Feb-55']

pd.to_datetime(dates, format="%d-%b-%y")
DatetimeIndex(['2005-09-26', '2005-09-26', '1970-06-15', '1994-12-05',
               '2061-01-09', '2055-02-08'],
              dtype='datetime64[ns]', freq=None)

The last two entries, which get returned as 2061 and 2055 for the years, are wrong. But this works fine for the 15-Jun-70 entry. What's going on here?

like image 681
dan_g Avatar asked Jun 11 '16 17:06

dan_g


People also ask

Why can’t I read the date column in pandas?

As shown above, the problem is that the date column is read as an object type instead of a date type, which prevents it from accessing any date-related functionalities in Pandas. The easy solution is to ask Pandas to parse the date for us. As shown below, we specify a list object containing the date column name to the parse_dates parameter.

How to parse the day month and year columns in pandas?

We used the to_datetime method available in Pandas to parse the day, month and year columns into a single date column. We can drop the first three columns as they are redundant. Further, we can check attributes’ data types . forestfire.drop (columns= ['day','month','year'], inplace=True) forestfire.info ()

How do I convert a string column to date time in pandas?

Let’s start by simply converting a string column to date time. We can load the Pandas DataFrame below and print out its data types using the info () method: While the data looks like dates, it’s actually formatted as strings. Let’s see how we can use the Pandas to_datetime function to convert the string column to a date time.

What kind of date type is date in pandas?

As expected, the date column is now a kind of date type (i.e., datetime64 [ns] ). Please be noted that if you have multiple date columns, you can use parse_dates= [“date”, “another_date”]. It should be noted that Pandas integrates powerful date parsers such that many different kinds of dates can be parsed automatically.


2 Answers

That seems to be the behavior of the Python library datetime, I did a test to see where the cutoff is 68 - 69:

datetime.datetime.strptime('31-Dec-68', '%d-%b-%y').date()
>>> datetime.date(2068, 12, 31)

datetime.datetime.strptime('1-Jan-69', '%d-%b-%y').date()
>>> datetime.date(1969, 1, 1)

Two digits year ambiguity

So it seems that anything with the %y year below 69 will be attributed a century of 2000, and 69 upwards get 1900

The %y two digits can only go from 00 to 99 which is going to be ambiguous if we start crossing centuries.

If there is no overlap, you could manually process it and annotate the century (kill the ambiguity)

I suggest you process your data manually and specify the century, e.g. you can decide that anything in your data that has the year between 17 and 68 is attributed to 1917 - 1968 (instead of 2017 - 2068).

If you have overlap then you can't process with insufficient year information, unless e.g. you have some ordered data and a reference

If you have overlap e.g. you have data from both 2016 and 1916 and both were logged as '16', that's ambiguous and there isn't sufficient information to parse this, unless the data is ordered by date in which case you can use heuristics to switch the century as you parse it.

like image 148
bakkal Avatar answered Oct 02 '22 19:10

bakkal


from the docs

Year 2000 (Y2K) issues: Python depends on the platform’s C library, which generally doesn’t have year 2000 issues, since all dates and times are represented internally as seconds since the epoch. Function strptime() can parse 2-digit years when given %y format code. When 2-digit years are parsed, they are converted according to the POSIX and ISO C standards: values 69–99 are mapped to 1969–1999, and values 0–68 are mapped to 2000–2068.

like image 38
MaxU - stop WAR against UA Avatar answered Oct 02 '22 19:10

MaxU - stop WAR against UA