Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How indexing works in Pandas?

Tags:

python

pandas

I am new to python. This seems like a basic question to ask. But I really want to understand what is happening here

import numpy as np 
import pandas as pd 
tempdata = np.random.random(5)
myseries_one = pd.Series(tempdata)
myseries_two = pd.Series(data = tempdata, index = ['a','b','c','d','e'])
myseries_three = pd.Series(data = tempdata, index = [10,11,12,13,14])


myseries_one
Out[1]: 
0    0.291293
1    0.381014
2    0.923360
3    0.271671
4    0.605989
dtype: float64

myseries_two
Out[2]: 
a    0.291293
b    0.381014
c    0.923360
d    0.271671
e    0.605989
dtype: float64

myseries_three
Out[3]: 
10    0.291293
11    0.381014
12    0.923360
13    0.271671
14    0.605989
dtype: float64

Indexing first element from each dataframe

myseries_one[0] #As expected
Out[74]: 0.29129291112626043

myseries_two[0] #As expected
Out[75]: 0.29129291112626043

myseries_three[0]
KeyError:0 

Doubt1 :-Why this is happenening ? Why myseries_three[0] gives me a keyError ? what we meant by calling myseries_one[0] , myseries_one[0] or myseries_three[0] ? Does calling this way mean we are calling by rownames ?

Doubt2 :-Is rownames and rownumber in Python works as different as rownames and rownumber in R ?

myseries_one[0:2]
Out[78]: 
0    0.291293
1    0.381014
dtype: float64

myseries_two[0:2]
Out[79]: 
a    0.291293
b    0.381014
dtype: float64

myseries_three[0:2]
Out[80]: 
10    0.291293
11    0.381014
dtype: float64

Doubt3:- If calling myseries_three[0] meant calling by rownames then how myseries_three[0:3] producing the output ? does myseries_three[0:4] mean we are calling by rownumber ? Please explain and guide. I am migrating from R to python. so its a bit confusing for me.

like image 702
learner Avatar asked Aug 12 '16 12:08

learner


People also ask

How do pandas indexes work?

The index property returns the index information of the DataFrame. The index information contains the labels of the rows. If the rows has NOT named indexes, the index property returns a RangeIndex object with the start, stop, and step values.

Does Panda series have index?

Pandas series is a One-dimensional ndarray with axis labels. The labels need not be unique but must be a hashable type. The object supports both integer- and label-based indexing and provides a host of methods for performing operations involving the index.

How indexing in Numpy and pandas are different?

Pandas has a better performance when a number of rows is 500K or more. Numpy has a better performance when number of rows is 50K or less. Indexing of the pandas series is very slow as compared to numpy arrays. Indexing of numpy Arrays is very fast.

What is the data type of index in pandas?

Pandas Index is an immutable ndarray implementing an ordered, sliceable set. It is the basic object which stores the axis labels for all pandas objects.


1 Answers

When you are attempting to slice with myseries[something], the something is often ambiguous. You are highlighting a case of that ambiguity. In your case, pandas is trying to help you out by guessing what you mean.

myseries_one[0] #As expected
Out[74]: 0.29129291112626043

myseries_one has integer labels. It would make sense that when you attempt to slice with an integer that you intend to get the element that is labeled with that integer. It turns out, that you have an element labeled with 0 an so that is returned to you.

myseries_two[0] #As expected
Out[75]: 0.29129291112626043

myseries_two has string labels. It's highly unlikely that you meant to slice this series with a label of 0 when labels are all strings. So, pandas assumes that you meant a position of 0 and returns the first element (thanks pandas, that was helpful).

myseries_three[0]
KeyError:0 

myseries_three has integer labels and you are attempting to slice with an integer... perfect. Let's just get that value for you... KeyError. Whoops, that index label does not exist. In this case, it is safer for pandas to fail than to guess that maybe you meant to slice by position. The documentation even suggests that if you want to remove the ambiguity, use loc for label based slicing and iloc for position based slicing.

Let's try loc

myseries_one.loc[0]
0.29129291112626043

myseries_two.loc[0]
KeyError:0 

myseries_three.loc[0]
KeyError:0 

Only myseries_one has a label 0. The other two return KeyErrors

Let's try iloc

myseries_one.iloc[0]
0.29129291112626043

myseries_two.iloc[0]
0.29129291112626043

myseries_three.iloc[0]
0.29129291112626043

They all have a position of 0 and return the first element accordingly.


For the range slicing, pandas decides to be less interpretive and sticks to positional slicing for the integer slice 0:2. Keep in mind. Actual real people (the programmers writing pandas code) are the ones making these decisions. When you are attempting to do something that is ambiguous, you may get varying results. To remove ambiguity, use loc and iloc.

iloc

myseries_one.iloc[0:2]

0    0.291293
1    0.381014
dtype: float64

myseries_two.iloc[0:2]

a    0.291293
b    0.381014
dtype: float64

myseries_three.iloc[0:2]

10    0.291293
11    0.381014
dtype: float64

loc

myseries_one.loc[0:2]

0    0.291293
1    0.381014
2    0.923360
dtype: float64

myseries_two.loc[0:2]

TypeError: cannot do slice indexing on <class 'pandas.indexes.base.Index'> with these indexers [0] of <type 'int'>

myseries_three.loc[0:2]

Series([], dtype: float64)
like image 87
piRSquared Avatar answered Oct 31 '22 19:10

piRSquared