Why do we use 'loc' for pandas dataframes? it seems the following code with or without using loc both compile anr run at a simulular speed <pre class="prettyprint"><code>%timeit df_user1 = df.loc[df.user_id=='5561'] 100 loops, best of 3: 11.9 ms per loop </code></pre> or <pre class="prettyprint"><code>%timeit df_user1_noloc = df[df.user_id=='5561'] 100 loops, best of 3: 12 ms per loop </code></pre> So why use loc? Edit: This has been flagged as a duplicate question. But although pandas iloc vs ix vs loc explanation? does mention that * <blockquote> you can do column retrieval just by using the data frame's getitem: </blockquote> * <pre class="prettyprint"><code>df['time'] # equivalent to df.loc[:, 'time'] </code></pre> it does not say why we use loc, although it does explain lots of features of loc, my specific question is 'why not just omit loc altogether'? for which i have accepted a very detailed answer below. Also that other post the answer (which i do not think is an answer) is very hidden in the discussion and any person searching for what i was looking for would find it hard to locate the information and would be much better served by the answer provided to my question.

<ul> <li> Explicit is better than implicit. <code>df[boolean_mask]</code> selects rows where <code>boolean_mask</code> is True, but there is a corner case when you might not want it to: when <code>df</code> has boolean-valued column labels: <pre class="prettyprint"><code>In [229]: df = pd.DataFrame({True:[1,2,3],False:[3,4,5]}); df Out[229]: False True 0 3 1 1 4 2 2 5 3 </code></pre> You might want to use <code>df[[True]]</code> to select the <code>True</code> column. Instead it raises a <code>ValueError</code>: <pre class="prettyprint"><code>In [230]: df[[True]] ValueError: Item wrong length 1 instead of 3. </code></pre> Versus using <code>loc</code>: <pre class="prettyprint"><code>In [231]: df.loc[[True]] Out[231]: False True 0 3 1 </code></pre> In contrast, the following does not raise <code>ValueError</code> even though the structure of <code>df2</code> is almost the same as <code>df1</code> above: <pre class="prettyprint"><code>In [258]: df2 = pd.DataFrame({'A':[1,2,3],'B':[3,4,5]}); df2 Out[258]: A B 0 1 3 1 2 4 2 3 5 In [259]: df2[['B']] Out[259]: B 0 3 1 4 2 5 </code></pre> Thus, <code>df[boolean_mask]</code> does not always behave the same as <code>df.loc[boolean_mask]</code>. Even though this is arguably an unlikely use case, I would recommend always using <code>df.loc[boolean_mask]</code> instead of <code>df[boolean_mask]</code> because the meaning of <code>df.loc</code>'s syntax is explicit. With <code>df.loc[indexer]</code> you know automatically that <code>df.loc</code> is selecting rows. In contrast, it is not clear if <code>df[indexer]</code> will select rows or columns (or raise <code>ValueError</code>) without knowing details about <code>indexer</code> and <code>df</code>. </li> <li> <code>df.loc[row_indexer, column_index]</code> can select rows and columns. <code>df[indexer]</code> can only select rows or columns depending on the type of values in <code>indexer</code> and the type of column values <code>df</code> has (again, are they boolean?). <pre class="prettyprint"><code>In [237]: df2.loc[[True,False,True], 'B'] Out[237]: 0 3 2 5 Name: B, dtype: int64 </code></pre> </li> <li> When a slice is passed to <code>df.loc</code> the end-points are included in the range. When a slice is passed to <code>df[...]</code>, the slice is interpreted as a half-open interval: <pre class="prettyprint"><code>In [239]: df2.loc[1:2] Out[239]: A B 1 2 4 2 3 5 In [271]: df2[1:2] Out[271]: A B 1 2 4 </code></pre> </li> </ul>

Python: Pandas Series - Why use loc?

Tags:

python

pandas

series

pandas-loc

Why do we use 'loc' for pandas dataframes? it seems the following code with or without using loc both compile anr run at a simulular speed

Click to copy

%timeit df_user1 = df.loc[df.user_id=='5561']  100 loops, best of 3: 11.9 ms per loop

Click to copy

%timeit df_user1_noloc = df[df.user_id=='5561']  100 loops, best of 3: 12 ms per loop

So why use loc?

Edit: This has been flagged as a duplicate question. But although pandas iloc vs ix vs loc explanation? does mention that *

you can do column retrieval just by using the data frame's getitem:

Click to copy

df['time']    # equivalent to df.loc[:, 'time']

it does not say why we use loc, although it does explain lots of features of loc, my specific question is 'why not just omit loc altogether'? for which i have accepted a very detailed answer below.

Also that other post the answer (which i do not think is an answer) is very hidden in the discussion and any person searching for what i was looking for would find it hard to locate the information and would be much better served by the answer provided to my question.

566

asked Aug 11 '16 01:08

Runner Bean

1 Answers

Explicit is better than implicit.

df[boolean_mask] selects rows where boolean_mask is True, but there is a corner case when you might not want it to: when df has boolean-valued column labels:

Click to copy
```
In [229]: df = pd.DataFrame({True:[1,2,3],False:[3,4,5]}); df Out[229]:     False  True  0      3      1 1      4      2 2      5      3 
```
You might want to use df[[True]] to select the True column. Instead it raises a ValueError:

Click to copy
```
In [230]: df[[True]] ValueError: Item wrong length 1 instead of 3. 
```
Versus using loc:

Click to copy
```
In [231]: df.loc[[True]] Out[231]:     False  True  0      3      1 
```
In contrast, the following does not raise ValueError even though the structure of df2 is almost the same as df1 above:

Click to copy
```
In [258]: df2 = pd.DataFrame({'A':[1,2,3],'B':[3,4,5]}); df2 Out[258]:     A  B 0  1  3 1  2  4 2  3  5  In [259]: df2[['B']] Out[259]:     B 0  3 1  4 2  5 
```
Thus, df[boolean_mask] does not always behave the same as df.loc[boolean_mask]. Even though this is arguably an unlikely use case, I would recommend always using df.loc[boolean_mask] instead of df[boolean_mask] because the meaning of df.loc's syntax is explicit. With df.loc[indexer] you know automatically that df.loc is selecting rows. In contrast, it is not clear if df[indexer] will select rows or columns (or raise ValueError) without knowing details about indexer and df.
df.loc[row_indexer, column_index] can select rows and columns. df[indexer] can only select rows or columns depending on the type of values in indexer and the type of column values df has (again, are they boolean?).

Click to copy
```
In [237]: df2.loc[[True,False,True], 'B'] Out[237]:  0    3 2    5 Name: B, dtype: int64 
```
When a slice is passed to df.loc the end-points are included in the range. When a slice is passed to df[...], the slice is interpreted as a half-open interval:

Click to copy
```
In [239]: df2.loc[1:2] Out[239]:     A  B 1  2  4 2  3  5  In [271]: df2[1:2] Out[271]:     A  B 1  2  4 
```

188

answered Oct 07 '22 12:10

unutbu

Related questions
                            
                                Recursively iterate through all subdirectories using pathlib
                            
                                Python equivalent of a given wget command
                            
                                JWT: 'module' object has no attribute 'encode'
                            
                                Python : How to parse the Body from a raw email , given that raw email does not have a "Body" tag or anything
                            
                                hasattr() vs try-except block to deal with non-existent attributes
                            
                                How to run Pip commands from CMD
                            
                                ImportError: cannot import name NUMPY_MKL
                            
                                Python round up integer to next hundred
                            
                                sort dict by value python [duplicate]
                            
                                Pipe character in Python
                            
                                Django check if a related object exists error: RelatedObjectDoesNotExist
                            
                                multiple axis in matplotlib with different scales [duplicate]
                            
                                How to export a table dataframe in PySpark to csv?
                            
                                docker.errors.DockerException: Error while fetching server API version
                            
                                Execute curl command within a Python script
                            
                                What makes sets faster than lists?
                            
                                TypeError: 'str' object is not callable (Python)
                            
                                Checking validity of email in django/python [duplicate]
                            
                                How can I know which exceptions might be thrown from a method call?
                            
                                Difference between entry_points/console_scripts and scripts in setup.py?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Python: Pandas Series - Why use loc?

Tags:

python

pandas

series

pandas-loc

Runner Bean

People also ask

1 Answers

unutbu

Recent Activity

Donate For Us