Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas: Subindexing dataframes: Copies vs views

Say I have a dataframe

import pandas as pd import numpy as np foo = pd.DataFrame(np.random.random((10,5))) 

and I create another dataframe from a subset of my data:

bar = foo.iloc[3:5,1:4] 

does bar hold a copy of those elements from foo? Is there any way to create a view of that data instead? If so, what would happen if I try to modify data in this view? Does Pandas provide any sort of copy-on-write mechanism?

like image 207
Amelio Vazquez-Reina Avatar asked Jul 31 '13 02:07

Amelio Vazquez-Reina


People also ask

Does LOC return a view or copy?

loc[mask] returns a new DataFrame with a copy of the data from df . Then df.

Does ILOC make a copy?

@Qiyu with multiple dtypes yes.

Why do we use copy () in pandas?

The copy() method returns a copy of the DataFrame. By default, the copy is a "deep copy" meaning that any changes made in the original DataFrame will NOT be reflected in the copy.

What are the different data structures in pandas?

The pandas library has mainly two data structures DataFrames and Series. These data structures are internally represented with index arrays, which label the data, and data arrays, which contain the actual data.

Why does pandas query return a copy of a view?

The dev docs, which you pointed, offer a much more full explanation. .query will ALWAYS return a copy because of what its doing (and not a view), because its evaluated by n numexpr. So i'll add that to the 'rules' pandas relies on numpy to determine whether a view is generated.

Is it possible to index on multiple dtyped objects in pandas?

An indexer that gets on a multiple-dtyped object is always a copy. is not guaranteed to work (and thus you shoulld never do this). The chained indexing is 2 separate python operations and thus cannot be reliably intercepted by pandas (you will oftentimes get a SettingWithCopyWarning, but that is not 100% detectable either).

Why doesn't indexing work with pandas?

is not guaranteed to work (and thus you shoulld never do this). The chained indexing is 2 separate python operations and thus cannot be reliably intercepted by pandas (you will oftentimes get a SettingWithCopyWarning, but that is not 100% detectable either). The dev docs, which you pointed, offer a much more full explanation.


1 Answers

Your answer lies in the pandas docs: returning-a-view-versus-a-copy.

Whenever an array of labels or a boolean vector are involved in the indexing operation, the result will be a copy. With single label / scalar indexing and slicing, e.g. df.ix[3:6] or df.ix[:, 'A'], a view will be returned.

In your example, bar is a view of slices of foo. If you wanted a copy, you could have used the copy method. Modifying bar also modifies foo. pandas does not appear to have a copy-on-write mechanism.

See my code example below to illustrate:

In [1]: import pandas as pd    ...: import numpy as np    ...: foo = pd.DataFrame(np.random.random((10,5)))    ...:   In [2]: pd.__version__ Out[2]: '0.12.0.dev-35312e4'  In [3]: np.__version__ Out[3]: '1.7.1'  In [4]: # DataFrame has copy method    ...: foo_copy = foo.copy()  In [5]: bar = foo.iloc[3:5,1:4]  In [6]: bar == foo.iloc[3:5,1:4] == foo_copy.iloc[3:5,1:4] Out[6]:        1     2     3 3  True  True  True 4  True  True  True  In [7]: # Changing the view    ...: bar.ix[3,1] = 5  In [8]: # View and DataFrame still equal    ...: bar == foo.iloc[3:5,1:4] Out[8]:        1     2     3 3  True  True  True 4  True  True  True  In [9]: # It is now different from a copy of original    ...: bar == foo_copy.iloc[3:5,1:4] Out[9]:         1     2     3 3  False  True  True 4   True  True  True 
like image 80
davidshinn Avatar answered Sep 22 '22 13:09

davidshinn