Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pandas dataframe view vs copy, how do I tell?

Tags:

python

pandas

What's the difference between:

pandas df.loc[:,('col_a','col_b')]

and

df.loc[:,['col_a','col_b']]

The link below doesn't mention the latter, though it works. Do both pull a view? Does the first pull a view and the second pull a copy? Love learning Pandas.

http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

Thanks

like image 633
user3659451 Avatar asked Dec 08 '14 21:12

user3659451


People also ask

Does LOC return a view or copy?

loc[mask] returns a new DataFrame with a copy of the data from df .

How can you tell if two DataFrame are identical?

DataFrame - equals() function The equals() function is used to test whether two objects contain the same elements. This function allows two Series or DataFrames to be compared against each other to see if they have the same shape and elements.


1 Answers

If your DataFrame has a simple column index, then there is no difference. For example,

In [8]: df = pd.DataFrame(np.arange(12).reshape(4,3), columns=list('ABC'))

In [9]: df.loc[:, ['A','B']]
Out[9]: 
   A   B
0  0   1
1  3   4
2  6   7
3  9  10

In [10]: df.loc[:, ('A','B')]
Out[10]: 
   A   B
0  0   1
1  3   4
2  6   7
3  9  10

But if the DataFrame has a MultiIndex, there can be a big difference:

df = pd.DataFrame(np.random.randint(10, size=(5,4)),
                  columns=pd.MultiIndex.from_arrays([['foo']*2+['bar']*2,
                                                     list('ABAB')]),
                  index=pd.MultiIndex.from_arrays([['baz']*2+['qux']*3,
                                                   list('CDCDC')]))

#       foo    bar   
#         A  B   A  B
# baz C   7  9   9  9
#     D   7  5   5  4
# qux C   5  0   5  1
#     D   1  7   7  4
#     C   6  4   3  5

In [27]: df.loc[:, ('foo','B')]
Out[27]: 
baz  C    9
     D    5
qux  C    0
     D    7
     C    4
Name: (foo, B), dtype: int64

In [28]: df.loc[:, ['foo','B']]
KeyError: 'MultiIndex Slicing requires the index to be fully lexsorted tuple len (1), lexsort depth (0)'

The KeyError is saying that the MultiIndex has to be lexsorted. If we do that, then we still get a different result:

In [29]: df.sortlevel(axis=1).loc[:, ('foo','B')]
Out[29]: 
baz  C    9
     D    5
qux  C    0
     D    7
     C    4
Name: (foo, B), dtype: int64

In [30]: df.sortlevel(axis=1).loc[:, ['foo','B']]
Out[30]: 
      foo   
        A  B
baz C   7  9
    D   7  5
qux C   5  0
    D   1  7
    C   6  4

Why is that? df.sortlevel(axis=1).loc[:, ('foo','B')] is selecting the column where the first column level equals foo, and the second column level is B.

In contrast, df.sortlevel(axis=1).loc[:, ['foo','B']] is selecting the columns where the first column level is either foo or B. With respect to the first column level, there are no B columns, but there are two foo columns.

I think the operating principle with Pandas is that if you use df.loc[...] as an expression, you should assume df.loc may be returning a copy or a view. The Pandas docs do not specify any rules about which you should expect. However, if you make an assignment of the form

df.loc[...] = value

then you can trust Pandas to alter df itself.

The reason why the documentation warns about the distinction between views and copies is so that you are aware of the pitfall of using chain assignments of the form

df.loc[...][...] = value

Here, Pandas evaluates df.loc[...] first, which may be a view or a copy. Now if it is a copy, then

df.loc[...][...] = value

is altering a copy of some portion of df, and thus has no effect on df itself. To add insult to injury, the effect on the copy is lost as well since there are no references to the copy and thus there is no way to access the copy after the assignment statement completes, and (at least in CPython) it is therefore soon-to-be garbage collected.


I do not know of a practical fool-proof a priori way to determine if df.loc[...] is going to return a view or a copy.

However, there are some rules of thumb which may help guide your intuition (but note that we are talking about implementation details here, so there is no guarantee that Pandas needs to behave this way in the future):

  • If the resultant NDFrame can not be expressed as a basic slice of the underlying NumPy array, then it probably will be a copy. Thus, a selection of arbitrary rows or columns will lead to a copy. A selection of sequential rows and/or sequential columns (which may be expressed as a slice) may return a view.
  • If the resultant NDFrame has columns of different dtypes, then df.loc will again probably return a copy.

However, there is an easy way to determine if x = df.loc[..] is a view a postiori: Simply see if changing a value in x affects df. If it does, it is a view, if not, x is a copy.

like image 53
unutbu Avatar answered Oct 07 '22 22:10

unutbu