I am a bit suprised/confused about the following difference between numpy and Pandas
import numpy as np
import pandas as pd
a = np.random.randn(10,10)
> a[:3,0, newaxis]
array([[-1.91687144],
[-0.6399471 ],
[-0.10005721]])
However:
b = pd.DataFrame(a)
> b.ix[:3,0]
0 -1.916871
1 -0.639947
2 -0.100057
3 0.251988
In other words, numpy does not include the stop
index in start:stop
notation, but Pandas does. I thought Pandas was based on Numpy. Is this a bug? Intentional?
Start Stop Step Python | slice () Parameters 1 start ( optional )- Starting index value where the slicing of the object starts. Default to 0 if not provided. 2 stop – Index value until which the slicing takes place. 3 step (optional) – Index value steps between each index for slicing. Defaults to 1 if not provided. More ...
How to Slice a DataFrame in Pandas #1 Checking the Version of Pandas. #2 Importing a Data Set in to Python. One of the most common operations that people use with Pandas is to read some kind... #3 Creating a DataFrame. Besides creating a DataFrame by reading a file, you can also create one via a ...
Numpy has a better performance when number of rows is 50K or less. Indexing of the pandas series is very slow as compared to numpy arrays. Indexing of numpy Arrays is very fast.
If a : is inserted in front of it, all items from that index onwards will be extracted. If two parameters (with : between them) is used, items between the two indexes (not including the stop index) with default step one are sliced. # slice items starting from index import numpy as np a = np.arange(10) print a[2:]
This is documented, and it's part of Advanced Indexing. The key here is that you're not using a stop index at all.
The ix
attribute is a special thing that lets you do various kinds of advanced indexing by label—choosing a list of labels, using an inclusive range of labels instead of a half-exclusive range of indices, and various other things.
If you don't want that, just don't use it:
In [191]: b[:3][0]
Out[191]:
0 -0.209386
1 0.050345
2 0.318414
Name: 0
If you play with this a bit more without reading the docs, you'll probably come up with a case where your labels are, say, 'A', 'B', 'C', 'D'
instead of 0, 1, 2, 3
, and suddenly, b.ix[:3]
will returns only 3 rows instead of 4, and you'll be baffled all over again.
The difference is that in that case, b.ix[:3]
is a slice on indices, not on labels.
What you've requested in your code is actually ambiguous between "all labels up to an including 3" and "all indices up to but not including 3", and labels always win with ix
(because if you don't want label slicing, you don't have to use ix
in the first place). And that's why I said the problem is that you're not using a stop index at all.
When the index type is integer, DataFrame.ix
will use label-based indexing only. According to the document, label based slice will include start and stop.
http://pandas.pydata.org/pandas-docs/dev/indexing.html#advanced-indexing-with-labels
Slicing with labels is semantically slightly different because the slice start and stop are inclusive in the label-based case.
Label-based indexing with integer axis labels is a thorny topic. It has been discussed heavily on mailing lists and among various members of the scientific Python community. In pandas, our general viewpoint is that labels matter more than integer locations. Therefore, with an integer axis index only label-based indexing is possible with the standard tools like .ix. The following code will generate exceptions
From (docs):
Slicing has standard Python semantics for integer slices
...
Slicing with labels is semantically slightly different because the slice start and stop are inclusive in the label-based case.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With