Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Unexpected result using .ix indexing with list vs range

Tags:

python

pandas

Can someone explain this behavior to me?

import pandas as pd

dates = pd.date_range('1/1/2000', periods=8)
df = pd.DataFrame(np.random.randn(8, 4), index=dates, columns=['A', 'B', 'C', 'D'])

df.ix['2000-01-01':'2000-01-02', ['A', 'C']]

## Output:
                   A         C
2000-01-01  0.224944 -0.689382
2000-01-02 -0.824735 -0.805512

df.ix[['2000-01-01', '2000-01-02'], ['A', 'C']]

## Output:
             A   C
2000-01-01 NaN NaN
2000-01-02 NaN NaN

I was expecting both indexing operations to return the same (first) result.

Then I sort of got it:

from datetime import datetime

df.loc[[datetime(2000, 1, 1), datetime(2000, 1, 5)], ['A','C']]

## Output
                   A         C
2000-01-01  0.224944 -0.689382
2000-01-05 -0.393747  0.462126

Now, I don't know the internals of pandas and why it implicitly converts strings to dates when given a range but not when given a list, but my guess is that a range makes it clear that we mean an object with ordinal nature so pandas perhaps checks the index, sees that it is a datetime and so parses the strings as dates.

But then the question becomes, why does it do the right thing when we supply a single string?

df.loc['2000-01-01', ['A','C']]

## Output:
A    0.224944
C   -0.689382
Name: 2000-01-01 00:00:00, dtype: float64

Is it a performance issue of not trying to convert multiple values when given a list? Some other design decision?

like image 951
kliron Avatar asked Dec 25 '15 20:12

kliron


1 Answers

Accessing DatetimeIndex with strings is kind-of hacked in (because R does this it's in there, but easy to find some edge cases like this). That is to say:

  • it works for slices.
  • it works for single access.
  • it may work for some other cases, but I wouldn't count on it.

It's much better to use Timestamps rather than strings:

In [11]: df.ix[pd.Timestamp('2000-01-01'), ['A','C']]
Out[11]:
A    0.480959
C    0.468689
Name: 2000-01-01 00:00:00, dtype: float64

In [12]: df.ix[pd.Timestamp('2000-01-01'):pd.Timestamp('2000-01-02'), ['A','C']]
Out[12]:
                   A         C
2000-01-01  0.480959  0.468689
2000-01-02 -0.971965 -0.840954

In [13]: df.ix[[pd.Timestamp('2000-01-01'), pd.Timestamp('2000-01-02')], ['A', 'C']]
Out[13]:
                   A         C
2000-01-01  0.480959  0.468689
2000-01-02 -0.971965 -0.840954

In [14]: df.ix[pd.to_datetime(['2000-01-01', '2000-01-02']), ['A', 'C']]
Out[14]:
                   A         C
2000-01-01  0.480959  0.468689
2000-01-02 -0.971965 -0.840954

As mentioned in your answer, this is a little cleaner (though there is no ambiguity in this case) as .loc rather than .ix.

like image 101
Andy Hayden Avatar answered Oct 15 '22 18:10

Andy Hayden