Summary: This doesn't work:
df[df.key==1]['D'] = 1
but this does:
df.D[df.key==1] = 1
Why?
Reproduction:
In [1]: import pandas as pd
In [2]: from numpy.random import randn
In [4]: df = pd.DataFrame(randn(6,3),columns=list('ABC'))
In [5]: df
Out[5]:
A B C
0 1.438161 -0.210454 -1.983704
1 -0.283780 -0.371773 0.017580
2 0.552564 -0.610548 0.257276
3 1.931332 0.649179 -1.349062
4 1.656010 -1.373263 1.333079
5 0.944862 -0.657849 1.526811
In [6]: df['D']=0.0
In [7]: df['key']=3*[1]+3*[2]
In [8]: df
Out[8]:
A B C D key
0 1.438161 -0.210454 -1.983704 0 1
1 -0.283780 -0.371773 0.017580 0 1
2 0.552564 -0.610548 0.257276 0 1
3 1.931332 0.649179 -1.349062 0 2
4 1.656010 -1.373263 1.333079 0 2
5 0.944862 -0.657849 1.526811 0 2
This doesn't work:
In [9]: df[df.key==1]['D'] = 1
In [10]: df
Out[10]:
A B C D key
0 1.438161 -0.210454 -1.983704 0 1
1 -0.283780 -0.371773 0.017580 0 1
2 0.552564 -0.610548 0.257276 0 1
3 1.931332 0.649179 -1.349062 0 2
4 1.656010 -1.373263 1.333079 0 2
5 0.944862 -0.657849 1.526811 0 2
but this does:
In [11]: df.D[df.key==1] = 3.4
In [12]: df
Out[12]:
A B C D key
0 1.438161 -0.210454 -1.983704 3.4 1
1 -0.283780 -0.371773 0.017580 3.4 1
2 0.552564 -0.610548 0.257276 3.4 1
3 1.931332 0.649179 -1.349062 0.0 2
4 1.656010 -1.373263 1.333079 0.0 2
5 0.944862 -0.657849 1.526811 0.0 2
Link to notebook
My question is:
Why does only the 2nd way work? I can't seem to see a difference in selection/indexing logic.
Version is 0.10.0
Edit: This should not be done like this anymore. Since version 0.11, there is
.loc
. See here: http://pandas.pydata.org/pandas-docs/stable/indexing.html
Indexing in pandas means simply selecting particular rows and columns of data from a DataFrame. Indexing could mean selecting all the rows and some of the columns, some of the rows and all of the columns, or some of each of the rows and columns. Indexing can also be known as Subset Selection.
Like a Python dictionary (or a relational database's index), Pandas indexing provides a fast way to turn a key into a value.
pandas. reset_index in pandas is used to reset index of the dataframe object to default indexing (0 to number of rows minus 1) or to reset multi level index.
The pandas documentation says:
Returning a view versus a copy
The rules about when a view on the data is returned are entirely dependent on NumPy. Whenever an array of labels or a boolean vector are involved in the indexing operation, the result will be a copy. With single label / scalar indexing and slicing, e.g. df.ix[3:6] or df.ix[:, 'A'], a view will be returned.
In df[df.key==1]['D']
you first do boolean slicing (leading to a copy of the Dataframe), then you choose a column ['D'].
In df.D[df.key==1] = 3.4
, you first choose a column, then do boolean slicing on the resulting Series.
This seems to make the difference, although I must admit that it is a little counterintuitive.
Edit: The difference was identified by Dougal, see his comment: With version 1, the copy is made as the __getitem__
method is called for the boolean slicing. For version 2, only the __setitem__
method is accessed - thus not returning a copy but just assigning.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With