Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python: Pandas Series - Why use loc?

Why do we use 'loc' for pandas dataframes? it seems the following code with or without using loc both compile anr run at a simulular speed

%timeit df_user1 = df.loc[df.user_id=='5561']  100 loops, best of 3: 11.9 ms per loop 

or

%timeit df_user1_noloc = df[df.user_id=='5561']  100 loops, best of 3: 12 ms per loop 

So why use loc?

Edit: This has been flagged as a duplicate question. But although pandas iloc vs ix vs loc explanation? does mention that *

you can do column retrieval just by using the data frame's getitem:

*

df['time']    # equivalent to df.loc[:, 'time'] 

it does not say why we use loc, although it does explain lots of features of loc, my specific question is 'why not just omit loc altogether'? for which i have accepted a very detailed answer below.

Also that other post the answer (which i do not think is an answer) is very hidden in the discussion and any person searching for what i was looking for would find it hard to locate the information and would be much better served by the answer provided to my question.

like image 566
Runner Bean Avatar asked Aug 11 '16 01:08

Runner Bean


People also ask

Why we use loc function?

loc is label-based, which means that you have to specify rows and columns based on their row and column labels. iloc is integer position-based, so you have to specify rows and columns by their integer position values (0-based integer position).

What does the loc command do in Python?

The loc() function helps us to retrieve data values from a dataset at an ease. Using the loc() function, we can access the data values fitted in the particular row or column based on the index value passed to the function.

What is the difference between loc and at in Pandas?

Conclusion. When you'd like to access just one value in a pandas DataFrame, both the loc and at functions will work fine. However, when you'd like to access a group of rows and columns, only the loc function is able to do so.

Why do we use loc and ILOC?

When it comes to selecting rows and columns of a pandas DataFrame, loc and iloc are two commonly used functions. Here is the subtle difference between the two functions: loc selects rows and columns with specific labels. iloc selects rows and columns at specific integer positions.


1 Answers

  • Explicit is better than implicit.

    df[boolean_mask] selects rows where boolean_mask is True, but there is a corner case when you might not want it to: when df has boolean-valued column labels:

    In [229]: df = pd.DataFrame({True:[1,2,3],False:[3,4,5]}); df Out[229]:     False  True  0      3      1 1      4      2 2      5      3 

    You might want to use df[[True]] to select the True column. Instead it raises a ValueError:

    In [230]: df[[True]] ValueError: Item wrong length 1 instead of 3. 

    Versus using loc:

    In [231]: df.loc[[True]] Out[231]:     False  True  0      3      1 

    In contrast, the following does not raise ValueError even though the structure of df2 is almost the same as df1 above:

    In [258]: df2 = pd.DataFrame({'A':[1,2,3],'B':[3,4,5]}); df2 Out[258]:     A  B 0  1  3 1  2  4 2  3  5  In [259]: df2[['B']] Out[259]:     B 0  3 1  4 2  5 

    Thus, df[boolean_mask] does not always behave the same as df.loc[boolean_mask]. Even though this is arguably an unlikely use case, I would recommend always using df.loc[boolean_mask] instead of df[boolean_mask] because the meaning of df.loc's syntax is explicit. With df.loc[indexer] you know automatically that df.loc is selecting rows. In contrast, it is not clear if df[indexer] will select rows or columns (or raise ValueError) without knowing details about indexer and df.

  • df.loc[row_indexer, column_index] can select rows and columns. df[indexer] can only select rows or columns depending on the type of values in indexer and the type of column values df has (again, are they boolean?).

    In [237]: df2.loc[[True,False,True], 'B'] Out[237]:  0    3 2    5 Name: B, dtype: int64 
  • When a slice is passed to df.loc the end-points are included in the range. When a slice is passed to df[...], the slice is interpreted as a half-open interval:

    In [239]: df2.loc[1:2] Out[239]:     A  B 1  2  4 2  3  5  In [271]: df2[1:2] Out[271]:     A  B 1  2  4 
like image 188
unutbu Avatar answered Oct 07 '22 12:10

unutbu