Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas, loc vs non loc for boolean indexing

All the research I do point to using loc as the way to filter a dataframe by a col(s) value(s), today I was reading this and I discovered by the examples I tested, that loc isn't really needed when filtering cols by it's values:

EX:

df = pd.DataFrame(np.arange(0, 20, 0.5).reshape(8, 5), columns=['a', 'b', 'c', 'd', 'e'])    

df.loc[df['a'] >= 15]

      a     b     c     d     e
6  15.0  15.5  16.0  16.5  17.0
7  17.5  18.0  18.5  19.0  19.5

df[df['a'] >= 15]

      a     b     c     d     e
6  15.0  15.5  16.0  16.5  17.0
7  17.5  18.0  18.5  19.0  19.5

Note: I do know that doing loc or iloc return the rows by it's index and the position. I'm not comparing based on this functionality.

But when filtering, doing "where" clauses what's the difference between using or not using loc? If any. And why do all the examples I come across regarding this subject use loc?

like image 213
Miguel Avatar asked Nov 14 '18 09:11

Miguel


People also ask

Is boolean indexing possible in DataFrame?

Boolean indexing helps us to select the data from the DataFrames using a boolean vector. We need a DataFrame with a boolean index to use the boolean indexing.

Is Pandas query faster than loc?

The query function seams more efficient than the loc function. DF2: 2K records x 6 columns. The loc function seams much more efficient than the query function.

Why do we use loc in Pandas?

loc is used to select data by label. The labels of columns are the column names. For example, customerID, gender, SeniorCitizen are the first three column names (i.e. labels). We need to be careful about row labels.

What is the difference between loc [] and ILOC []?

The main difference between pandas loc[] vs iloc[] is loc gets DataFrame rows & columns by labels/names and iloc[] gets by integer Index/position. For loc[], if the label is not present it gives a key error. For iloc[], if the position is not present it gives an index error.


1 Answers

As per the docs, loc accepts a boolean array for selecting rows, and in your case

>>> df['a'] >= 15
>>> 
0    False
1    False
2    False
3    False
4    False
5    False
6     True
7     True
Name: a, dtype: bool

is treated as a boolean array.

The fact that you can omit loc here and issue df[df['a'] >= 15] is a special case convenience according to Wes McKinney, the author of pandas.

Quoting directly from his book, Python for Data Analysis, p. 144, df[val] is used to...

Select single column or sequence of columns from the DataFrame; special case conveniences: boolean array (filter rows), slice (slice rows), or boolean DataFrame (set values based on some criterion)

like image 78
timgeb Avatar answered Oct 16 '22 23:10

timgeb