I'm getting quite an unexpected behaviour using pandas' pd.to_datetime
. My pd.Series
is as follows:
0 2017-01-06 14:37:16
1 2017-01-27 00:00:00
2 2017-01-18 00:00:00
3 2017-01-26 00:00:00
4 None
...
454823 2019-10-22 11:20:03
454824 None
454825 2019-07-11 00:00:00
454826 None
454827 2019-07-15 00:00:00
Name: colx, Length: 454828, dtype: object
And when casting to datetime
I'm getting:
pd.to_datetime(df.colx, errors='coerce')
InvalidIndexError: Reindexing only valid with uniquely valued Index objects
While this seems to indicate that there are duplicate values in the index, I ran the following tests to check if that was the cause:
all(df.colx.index == range(df.colx.shape[0]))
# True
df.index.duplicated().any()
# False
So apparently there aren't any duplicate indexes. What could be causing this error?
The error you are receiving isn't related to your Index
. It's related to this issue which should have been fixed as of 0.25.0
. The issue was related to the following inconsistent handling of multiple null
types when using pd.to_datetime
import pandas as pd
pd.core.algorithms.unique([pd.NaT, None])
#array([NaT, None], dtype=object) # That is, `pd.NaT` and `None` are unique
pd.Index([pd.NaT, None]).is_unique
#False # However, Index.unique considers them duplicated
The handling of this different behavior should have been fixed with 0.25.0
. If it's still giving you a problem the solution is to ensure you have a single representation of null
values in the column you need to convert. In this case you can .fillna
with the datetime null value.
pd.to_datetime(df.colx.fillna(pd.NaT), errors='coerce')
pd.__version__
#'0.25.0'
pd.to_datetime([pd.NaT, None])
#DatetimeIndex(['NaT', 'NaT'], dtype='datetime64[ns]', freq=None)
I noticed such an error in your check procedure:
Calling df.index.duplicated().any()
checks only that the index has no
duplicates.
To investigate the issue, I created my input file from your data (just 10 data rows):
colx
2017-01-06 14:37:16
2017-01-27 00:00:00
2017-01-18 00:00:00
2017-01-26 00:00:00
None
2019-10-22 11:20:03
None
2019-07-11 00:00:00
None
2019-07-15 00:00:00
I read it calling read_csv, called df.duplicated().any()
and the
result was True, so there are duplicates in colx column.
Run df.duplicated()
and you will see that True is printed for
rows with index 6 and 8 (second and third instance of None string).
Another check: Run df.info()
and you will get:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 1 columns):
colx 10 non-null object
dtypes: object(1)
memory usage: 208.0+ bytes
This confirms that no element has "true" None value. There are only strings containing "None".
Another remark: all(df.colx.index == range(df.colx.shape[0]))
checks
only that the index contains consecutive numbers, which says
nothing about the content of colx.
I suppose your read your DataFrame calling e.g. read_csv, without any conversion, so colx column is of object (actually string) type.
In such case an attempt to call pd.to_datetime fails on the first element containing None (a string), because it can not be converted to datetime.
Try the following approach:
When reading the DataFrame, pass na_values=['None'] parameter. It provides that elements containing None are not left as strings, but are converted to NaNs.
Print the DataFrame (read from my limited source). Instead of None (a string) there will be NaN - a special case of float.
Run df.info(). This time the printout will be:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 1 columns):
colx 7 non-null object
dtypes: object(1)
memory usage: 208.0+ bytes
Note that there are only 7 non-null values, out of total 10, so the 3 remaining are "true" None values, which Pandas prints as NaN.
Run pd.to_datetime(df.colx). This time there should be no error.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With