Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Understanding pandas dataframe indexing

Summary: This doesn't work:

df[df.key==1]['D'] = 1

but this does:

df.D[df.key==1] = 1

Why?

Reproduction:

In [1]: import pandas as pd

In [2]: from numpy.random import randn

In [4]: df = pd.DataFrame(randn(6,3),columns=list('ABC'))

In [5]: df
Out[5]: 
          A         B         C
0  1.438161 -0.210454 -1.983704
1 -0.283780 -0.371773  0.017580
2  0.552564 -0.610548  0.257276
3  1.931332  0.649179 -1.349062
4  1.656010 -1.373263  1.333079
5  0.944862 -0.657849  1.526811

In [6]: df['D']=0.0

In [7]: df['key']=3*[1]+3*[2]

In [8]: df
Out[8]: 
          A         B         C  D  key
0  1.438161 -0.210454 -1.983704  0    1
1 -0.283780 -0.371773  0.017580  0    1
2  0.552564 -0.610548  0.257276  0    1
3  1.931332  0.649179 -1.349062  0    2
4  1.656010 -1.373263  1.333079  0    2
5  0.944862 -0.657849  1.526811  0    2

This doesn't work:

In [9]: df[df.key==1]['D'] = 1

In [10]: df
Out[10]: 
          A         B         C  D  key
0  1.438161 -0.210454 -1.983704  0    1
1 -0.283780 -0.371773  0.017580  0    1
2  0.552564 -0.610548  0.257276  0    1
3  1.931332  0.649179 -1.349062  0    2
4  1.656010 -1.373263  1.333079  0    2
5  0.944862 -0.657849  1.526811  0    2

but this does:

In [11]: df.D[df.key==1] = 3.4

In [12]: df
Out[12]: 
          A         B         C    D  key
0  1.438161 -0.210454 -1.983704  3.4    1
1 -0.283780 -0.371773  0.017580  3.4    1
2  0.552564 -0.610548  0.257276  3.4    1
3  1.931332  0.649179 -1.349062  0.0    2
4  1.656010 -1.373263  1.333079  0.0    2
5  0.944862 -0.657849  1.526811  0.0    2

Link to notebook

My question is:

Why does only the 2nd way work? I can't seem to see a difference in selection/indexing logic.

Version is 0.10.0

Edit: This should not be done like this anymore. Since version 0.11, there is .loc . See here: http://pandas.pydata.org/pandas-docs/stable/indexing.html

like image 690
K.-Michael Aye Avatar asked Jan 07 '13 09:01

K.-Michael Aye


People also ask

What are DataFrame indexes?

Indexing in pandas means simply selecting particular rows and columns of data from a DataFrame. Indexing could mean selecting all the rows and some of the columns, some of the rows and all of the columns, or some of each of the rows and columns. Indexing can also be known as Subset Selection.

Is pandas indexing fast?

Like a Python dictionary (or a relational database's index), Pandas indexing provides a fast way to turn a key into a value.

Does indexing in pandas start with 0?

pandas. reset_index in pandas is used to reset index of the dataframe object to default indexing (0 to number of rows minus 1) or to reset multi level index.


1 Answers

The pandas documentation says:

Returning a view versus a copy

The rules about when a view on the data is returned are entirely dependent on NumPy. Whenever an array of labels or a boolean vector are involved in the indexing operation, the result will be a copy. With single label / scalar indexing and slicing, e.g. df.ix[3:6] or df.ix[:, 'A'], a view will be returned.

In df[df.key==1]['D'] you first do boolean slicing (leading to a copy of the Dataframe), then you choose a column ['D'].

In df.D[df.key==1] = 3.4, you first choose a column, then do boolean slicing on the resulting Series.

This seems to make the difference, although I must admit that it is a little counterintuitive.

Edit: The difference was identified by Dougal, see his comment: With version 1, the copy is made as the __getitem__ method is called for the boolean slicing. For version 2, only the __setitem__ method is accessed - thus not returning a copy but just assigning.

like image 150
Thorsten Kranz Avatar answered Oct 15 '22 07:10

Thorsten Kranz