What's the difference between: pandas <code>df.loc[:,('col_a','col_b')]</code> and <code>df.loc[:,['col_a','col_b']]</code> The link below doesn't mention the latter, though it works. Do both pull a view? Does the first pull a view and the second pull a copy? Love learning Pandas. http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy Thanks

If your DataFrame has a simple column index, then there is no difference. For example, <pre class="prettyprint"><code>In [8]: df = pd.DataFrame(np.arange(12).reshape(4,3), columns=list('ABC')) In [9]: df.loc[:, ['A','B']] Out[9]: A B 0 0 1 1 3 4 2 6 7 3 9 10 In [10]: df.loc[:, ('A','B')] Out[10]: A B 0 0 1 1 3 4 2 6 7 3 9 10 </code></pre> But if the DataFrame has a MultiIndex, there can be a big difference: <pre class="prettyprint"><code>df = pd.DataFrame(np.random.randint(10, size=(5,4)), columns=pd.MultiIndex.from_arrays([['foo']*2+['bar']*2, list('ABAB')]), index=pd.MultiIndex.from_arrays([['baz']*2+['qux']*3, list('CDCDC')])) # foo bar # A B A B # baz C 7 9 9 9 # D 7 5 5 4 # qux C 5 0 5 1 # D 1 7 7 4 # C 6 4 3 5 In [27]: df.loc[:, ('foo','B')] Out[27]: baz C 9 D 5 qux C 0 D 7 C 4 Name: (foo, B), dtype: int64 In [28]: df.loc[:, ['foo','B']] KeyError: 'MultiIndex Slicing requires the index to be fully lexsorted tuple len (1), lexsort depth (0)' </code></pre> The KeyError is saying that the MultiIndex has to be lexsorted. If we do that, then we still get a different result: <pre class="prettyprint"><code>In [29]: df.sortlevel(axis=1).loc[:, ('foo','B')] Out[29]: baz C 9 D 5 qux C 0 D 7 C 4 Name: (foo, B), dtype: int64 In [30]: df.sortlevel(axis=1).loc[:, ['foo','B']] Out[30]: foo A B baz C 7 9 D 7 5 qux C 5 0 D 1 7 C 6 4 </code></pre> Why is that? <code>df.sortlevel(axis=1).loc[:, ('foo','B')]</code> is selecting the column where the first column level equals <code>foo</code>, and the second column level is <code>B</code>. In contrast, <code>df.sortlevel(axis=1).loc[:, ['foo','B']]</code> is selecting the columns where the first column level is either <code>foo</code> or <code>B</code>. With respect to the first column level, there are no <code>B</code> columns, but there are two <code>foo</code> columns. I think the operating principle with Pandas is that if you use <code>df.loc[...]</code> as an expression, you should assume <code>df.loc</code> may be returning a copy or a view. The Pandas docs do not specify any rules about which you should expect. However, if you make an assignment of the form <pre class="prettyprint"><code>df.loc[...] = value </code></pre> then you can trust Pandas to alter <code>df</code> itself. The reason why the documentation warns about the distinction between views and copies is so that you are aware of the pitfall of using chain assignments of the form <pre class="prettyprint"><code>df.loc[...][...] = value </code></pre> Here, Pandas evaluates <code>df.loc[...]</code> first, which may be a view or a copy. Now if it is a copy, then <pre class="prettyprint"><code>df.loc[...][...] = value </code></pre> is altering a copy of some portion of <code>df</code>, and thus has no effect on <code>df</code> itself. To add insult to injury, the effect on the copy is lost as well since there are no references to the copy and thus there is no way to access the copy after the assignment statement completes, and (at least in CPython) it is therefore soon-to-be garbage collected. <hr> I do not know of a practical fool-proof a priori way to determine if <code>df.loc[...]</code> is going to return a view or a copy. However, there are some rules of thumb which may help guide your intuition (but note that we are talking about implementation details here, so there is no guarantee that Pandas needs to behave this way in the future): <ul> <li>If the resultant NDFrame can not be expressed as a basic slice of the underlying NumPy array, then it probably will be a copy. Thus, a selection of arbitrary rows or columns will lead to a copy. A selection of sequential rows and/or sequential columns (which may be expressed as a slice) may return a view.</li> <li>If the resultant NDFrame has columns of different dtypes, then <code>df.loc</code> will again probably return a copy.</li> </ul> However, there is an easy way to determine if <code>x = df.loc[..]</code> is a view a postiori: Simply see if changing a value in <code>x</code> affects <code>df</code>. If it does, it is a view, if not, <code>x</code> is a copy.

pandas dataframe view vs copy, how do I tell?

1 Answers

If your DataFrame has a simple column index, then there is no difference. For example,

In [8]: df = pd.DataFrame(np.arange(12).reshape(4,3), columns=list('ABC'))

In [9]: df.loc[:, ['A','B']]
Out[9]: 
   A   B
0  0   1
1  3   4
2  6   7
3  9  10

In [10]: df.loc[:, ('A','B')]
Out[10]: 
   A   B
0  0   1
1  3   4
2  6   7
3  9  10

But if the DataFrame has a MultiIndex, there can be a big difference:

df = pd.DataFrame(np.random.randint(10, size=(5,4)),
                  columns=pd.MultiIndex.from_arrays([['foo']*2+['bar']*2,
                                                     list('ABAB')]),
                  index=pd.MultiIndex.from_arrays([['baz']*2+['qux']*3,
                                                   list('CDCDC')]))

#       foo    bar   
#         A  B   A  B
# baz C   7  9   9  9
#     D   7  5   5  4
# qux C   5  0   5  1
#     D   1  7   7  4
#     C   6  4   3  5

In [27]: df.loc[:, ('foo','B')]
Out[27]: 
baz  C    9
     D    5
qux  C    0
     D    7
     C    4
Name: (foo, B), dtype: int64

In [28]: df.loc[:, ['foo','B']]
KeyError: 'MultiIndex Slicing requires the index to be fully lexsorted tuple len (1), lexsort depth (0)'

The KeyError is saying that the MultiIndex has to be lexsorted. If we do that, then we still get a different result:

In [29]: df.sortlevel(axis=1).loc[:, ('foo','B')]
Out[29]: 
baz  C    9
     D    5
qux  C    0
     D    7
     C    4
Name: (foo, B), dtype: int64

In [30]: df.sortlevel(axis=1).loc[:, ['foo','B']]
Out[30]: 
      foo   
        A  B
baz C   7  9
    D   7  5
qux C   5  0
    D   1  7
    C   6  4

Why is that? df.sortlevel(axis=1).loc[:, ('foo','B')] is selecting the column where the first column level equals foo, and the second column level is B.

In contrast, df.sortlevel(axis=1).loc[:, ['foo','B']] is selecting the columns where the first column level is either foo or B. With respect to the first column level, there are no B columns, but there are two foo columns.

I think the operating principle with Pandas is that if you use df.loc[...] as an expression, you should assume df.loc may be returning a copy or a view. The Pandas docs do not specify any rules about which you should expect. However, if you make an assignment of the form

df.loc[...] = value

then you can trust Pandas to alter df itself.

The reason why the documentation warns about the distinction between views and copies is so that you are aware of the pitfall of using chain assignments of the form

df.loc[...][...] = value

Here, Pandas evaluates df.loc[...] first, which may be a view or a copy. Now if it is a copy, then

df.loc[...][...] = value

is altering a copy of some portion of df, and thus has no effect on df itself. To add insult to injury, the effect on the copy is lost as well since there are no references to the copy and thus there is no way to access the copy after the assignment statement completes, and (at least in CPython) it is therefore soon-to-be garbage collected.

I do not know of a practical fool-proof a priori way to determine if df.loc[...] is going to return a view or a copy.

However, there are some rules of thumb which may help guide your intuition (but note that we are talking about implementation details here, so there is no guarantee that Pandas needs to behave this way in the future):

If the resultant NDFrame can not be expressed as a basic slice of the underlying NumPy array, then it probably will be a copy. Thus, a selection of arbitrary rows or columns will lead to a copy. A selection of sequential rows and/or sequential columns (which may be expressed as a slice) may return a view.
If the resultant NDFrame has columns of different dtypes, then df.loc will again probably return a copy.

However, there is an easy way to determine if x = df.loc[..] is a view a postiori: Simply see if changing a value in x affects df. If it does, it is a view, if not, x is a copy.

answered Oct 07 '22 22:10

unutbu

Related questions
                            
                                Flask - how do I combine Flask-WTF and Flask-SQLAlchemy to edit db models?
                            
                                How do I get a windows current size using Tkinter?
                            
                                Python IDLE. Auto-complete/Show completions not working
                            
                                In Java what is idiomatically most similar to raising a Python ValueError?
                            
                                Non-member vs member functions in Python
                            
                                Forcing json to dump a json object even when list is empty
                            
                                How to tell if a string contains valid Python code
                            
                                How to run standalone files in PyCharm
                            
                                How to check if a string is null in python [duplicate]
                            
                                Correct way to use accessors in Python? [duplicate]
                            
                                "Uncatching" an exception in python
                            
                                Change OptionMenu based on what is selected in another OptionMenu
                            
                                How to wait until only the first thread is finished in Python
                            
                                return rows in a dataframe closest to a user-defined number
                            
                                how to execute Python 3.3 script in Spyder console with variables?
                            
                                sqlalchemy flask: AttributeError: 'Session' object has no attribute '_model_changes' on session.commit()
                            
                                How to output text from database with line breaks in a django template?
                            
                                How to close a python figure by keyboard input?
                            
                                Behaviour of truncate() method in Python
                            
                                min() arg is an empty sequence

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

pandas dataframe view vs copy, how do I tell?

Tags:

python

pandas

user3659451

People also ask

1 Answers

unutbu

Recent Activity

Donate For Us