Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Start:stop slicing inconsistencies between numpy and Pandas?

I am a bit suprised/confused about the following difference between numpy and Pandas

import numpy as np
import pandas as pd
a = np.random.randn(10,10)

> a[:3,0, newaxis]

array([[-1.91687144],
       [-0.6399471 ],
       [-0.10005721]])

However:

b = pd.DataFrame(a)

> b.ix[:3,0]

0   -1.916871
1   -0.639947
2   -0.100057
3    0.251988

In other words, numpy does not include the stop index in start:stop notation, but Pandas does. I thought Pandas was based on Numpy. Is this a bug? Intentional?

like image 351
Amelio Vazquez-Reina Avatar asked Feb 28 '13 01:02

Amelio Vazquez-Reina


People also ask

How to start and stop a slice in Python?

Start Stop Step Python | slice () Parameters 1 start ( optional )- Starting index value where the slicing of the object starts. Default to 0 if not provided. 2 stop – Index value until which the slicing takes place. 3 step (optional) – Index value steps between each index for slicing. Defaults to 1 if not provided. More ...

How to slice a Dataframe in pandas?

How to Slice a DataFrame in Pandas #1 Checking the Version of Pandas. #2 Importing a Data Set in to Python. One of the most common operations that people use with Pandas is to read some kind... #3 Creating a DataFrame. Besides creating a DataFrame by reading a file, you can also create one via a ...

Which is better NumPy or Pandas for indexing?

Numpy has a better performance when number of rows is 50K or less. Indexing of the pandas series is very slow as compared to numpy arrays. Indexing of numpy Arrays is very fast.

How to slice items starting from an index in NumPy?

If a : is inserted in front of it, all items from that index onwards will be extracted. If two parameters (with : between them) is used, items between the two indexes (not including the stop index) with default step one are sliced. # slice items starting from index import numpy as np a = np.arange(10) print a[2:]


3 Answers

This is documented, and it's part of Advanced Indexing. The key here is that you're not using a stop index at all.

The ix attribute is a special thing that lets you do various kinds of advanced indexing by label—choosing a list of labels, using an inclusive range of labels instead of a half-exclusive range of indices, and various other things.

If you don't want that, just don't use it:

In [191]: b[:3][0]
Out[191]: 
0   -0.209386
1    0.050345
2    0.318414
Name: 0

If you play with this a bit more without reading the docs, you'll probably come up with a case where your labels are, say, 'A', 'B', 'C', 'D' instead of 0, 1, 2, 3, and suddenly, b.ix[:3] will returns only 3 rows instead of 4, and you'll be baffled all over again.

The difference is that in that case, b.ix[:3] is a slice on indices, not on labels.

What you've requested in your code is actually ambiguous between "all labels up to an including 3" and "all indices up to but not including 3", and labels always win with ix (because if you don't want label slicing, you don't have to use ix in the first place). And that's why I said the problem is that you're not using a stop index at all.

like image 161
abarnert Avatar answered Sep 30 '22 18:09

abarnert


When the index type is integer, DataFrame.ix will use label-based indexing only. According to the document, label based slice will include start and stop.

http://pandas.pydata.org/pandas-docs/dev/indexing.html#advanced-indexing-with-labels

Slicing with labels is semantically slightly different because the slice start and stop are inclusive in the label-based case.

Label-based indexing with integer axis labels is a thorny topic. It has been discussed heavily on mailing lists and among various members of the scientific Python community. In pandas, our general viewpoint is that labels matter more than integer locations. Therefore, with an integer axis index only label-based indexing is possible with the standard tools like .ix. The following code will generate exceptions

like image 35
HYRY Avatar answered Sep 30 '22 18:09

HYRY


From (docs):

Slicing has standard Python semantics for integer slices

...

Slicing with labels is semantically slightly different because the slice start and stop are inclusive in the label-based case.

like image 40
wim Avatar answered Sep 30 '22 17:09

wim