Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pd.to_datetime producing "Reindexing only valid with uniquely valued Index objects"

Tags:

python

pandas

I'm getting quite an unexpected behaviour using pandas' pd.to_datetime. My pd.Series is as follows:

0         2017-01-06 14:37:16
1         2017-01-27 00:00:00
2         2017-01-18 00:00:00
3         2017-01-26 00:00:00
4                        None
                 ...         
454823    2019-10-22 11:20:03
454824                   None
454825    2019-07-11 00:00:00
454826                   None
454827    2019-07-15 00:00:00
Name: colx, Length: 454828, dtype: object

And when casting to datetime I'm getting:

pd.to_datetime(df.colx, errors='coerce')

InvalidIndexError: Reindexing only valid with uniquely valued Index objects

While this seems to indicate that there are duplicate values in the index, I ran the following tests to check if that was the cause:

all(df.colx.index == range(df.colx.shape[0]))
# True

df.index.duplicated().any()
# False

So apparently there aren't any duplicate indexes. What could be causing this error?

like image 485
yatu Avatar asked Oct 28 '19 17:10

yatu


2 Answers

The error you are receiving isn't related to your Index. It's related to this issue which should have been fixed as of 0.25.0. The issue was related to the following inconsistent handling of multiple null types when using pd.to_datetime

import pandas as pd

pd.core.algorithms.unique([pd.NaT, None])
#array([NaT, None], dtype=object) # That is, `pd.NaT` and `None` are unique

pd.Index([pd.NaT, None]).is_unique
#False   # However, Index.unique considers them duplicated

The handling of this different behavior should have been fixed with 0.25.0. If it's still giving you a problem the solution is to ensure you have a single representation of null values in the column you need to convert. In this case you can .fillna with the datetime null value.

pd.to_datetime(df.colx.fillna(pd.NaT), errors='coerce')

pd.__version__
#'0.25.0'

pd.to_datetime([pd.NaT, None])
#DatetimeIndex(['NaT', 'NaT'], dtype='datetime64[ns]', freq=None)
like image 80
ALollz Avatar answered Oct 06 '22 00:10

ALollz


Start from clearing a misunderstanding

I noticed such an error in your check procedure:

Calling df.index.duplicated().any() checks only that the index has no duplicates.

To investigate the issue, I created my input file from your data (just 10 data rows):

colx
2017-01-06 14:37:16
2017-01-27 00:00:00
2017-01-18 00:00:00
2017-01-26 00:00:00
None
2019-10-22 11:20:03
None
2019-07-11 00:00:00
None
2019-07-15 00:00:00

I read it calling read_csv, called df.duplicated().any() and the result was True, so there are duplicates in colx column.

Run df.duplicated() and you will see that True is printed for rows with index 6 and 8 (second and third instance of None string).

Another check: Run df.info() and you will get:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 1 columns):
colx    10 non-null object
dtypes: object(1)
memory usage: 208.0+ bytes

This confirms that no element has "true" None value. There are only strings containing "None".

Another remark: all(df.colx.index == range(df.colx.shape[0])) checks only that the index contains consecutive numbers, which says nothing about the content of colx.

How you read your DataFrame

I suppose your read your DataFrame calling e.g. read_csv, without any conversion, so colx column is of object (actually string) type.

In such case an attempt to call pd.to_datetime fails on the first element containing None (a string), because it can not be converted to datetime.

What to do

Try the following approach:

  1. When reading the DataFrame, pass na_values=['None'] parameter. It provides that elements containing None are not left as strings, but are converted to NaNs.

  2. Print the DataFrame (read from my limited source). Instead of None (a string) there will be NaN - a special case of float.

  3. Run df.info(). This time the printout will be:

    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 10 entries, 0 to 9
    Data columns (total 1 columns):
    colx    7 non-null object
    dtypes: object(1)
    memory usage: 208.0+ bytes
    

    Note that there are only 7 non-null values, out of total 10, so the 3 remaining are "true" None values, which Pandas prints as NaN.

  4. Run pd.to_datetime(df.colx). This time there should be no error.

like image 31
Valdi_Bo Avatar answered Oct 05 '22 23:10

Valdi_Bo