I'm coming across something that is almost certainly a stupid mistake on my part, but I can't seem to figure out what's going on.
Essentially, I have a series of dates as strings in the format "%d-%b-%y"
, such as 26-Sep-05
. When I go to convert them to datetime, the year is sometimes correct, but sometimes it is not.
E.g.:
dates = ['26-Sep-05', '26-Sep-05', '15-Jun-70', '5-Dec-94', '9-Jan-61', '8-Feb-55']
pd.to_datetime(dates, format="%d-%b-%y")
DatetimeIndex(['2005-09-26', '2005-09-26', '1970-06-15', '1994-12-05',
'2061-01-09', '2055-02-08'],
dtype='datetime64[ns]', freq=None)
The last two entries, which get returned as 2061 and 2055 for the years, are wrong. But this works fine for the 15-Jun-70
entry. What's going on here?
As shown above, the problem is that the date column is read as an object type instead of a date type, which prevents it from accessing any date-related functionalities in Pandas. The easy solution is to ask Pandas to parse the date for us. As shown below, we specify a list object containing the date column name to the parse_dates parameter.
We used the to_datetime method available in Pandas to parse the day, month and year columns into a single date column. We can drop the first three columns as they are redundant. Further, we can check attributes’ data types . forestfire.drop (columns= ['day','month','year'], inplace=True) forestfire.info ()
Let’s start by simply converting a string column to date time. We can load the Pandas DataFrame below and print out its data types using the info () method: While the data looks like dates, it’s actually formatted as strings. Let’s see how we can use the Pandas to_datetime function to convert the string column to a date time.
As expected, the date column is now a kind of date type (i.e., datetime64 [ns] ). Please be noted that if you have multiple date columns, you can use parse_dates= [“date”, “another_date”]. It should be noted that Pandas integrates powerful date parsers such that many different kinds of dates can be parsed automatically.
That seems to be the behavior of the Python library datetime, I did a test to see where the cutoff is 68 - 69:
datetime.datetime.strptime('31-Dec-68', '%d-%b-%y').date()
>>> datetime.date(2068, 12, 31)
datetime.datetime.strptime('1-Jan-69', '%d-%b-%y').date()
>>> datetime.date(1969, 1, 1)
Two digits year ambiguity
So it seems that anything with the %y year below 69 will be attributed a century of 2000, and 69 upwards get 1900
The %y
two digits can only go from 00
to 99
which is going to be ambiguous if we start crossing centuries.
If there is no overlap, you could manually process it and annotate the century (kill the ambiguity)
I suggest you process your data manually and specify the century, e.g. you can decide that anything in your data that has the year between 17 and 68 is attributed to 1917 - 1968 (instead of 2017 - 2068).
If you have overlap then you can't process with insufficient year information, unless e.g. you have some ordered data and a reference
If you have overlap e.g. you have data from both 2016 and 1916 and both were logged as '16', that's ambiguous and there isn't sufficient information to parse this, unless the data is ordered by date in which case you can use heuristics to switch the century as you parse it.
from the docs
Year 2000 (Y2K) issues: Python depends on the platform’s C library, which generally doesn’t have year 2000 issues, since all dates and times are represented internally as seconds since the epoch. Function strptime() can parse 2-digit years when given %y format code. When 2-digit years are parsed, they are converted according to the POSIX and ISO C standards: values 69–99 are mapped to 1969–1999, and values 0–68 are mapped to 2000–2068.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With