All the research I do point to using loc
as the way to filter a dataframe by a col(s) value(s), today I was reading this and I discovered by the examples I tested, that loc
isn't really needed when filtering cols by it's values:
EX:
df = pd.DataFrame(np.arange(0, 20, 0.5).reshape(8, 5), columns=['a', 'b', 'c', 'd', 'e'])
df.loc[df['a'] >= 15]
a b c d e
6 15.0 15.5 16.0 16.5 17.0
7 17.5 18.0 18.5 19.0 19.5
df[df['a'] >= 15]
a b c d e
6 15.0 15.5 16.0 16.5 17.0
7 17.5 18.0 18.5 19.0 19.5
Note: I do know that doing loc
or iloc
return the rows by it's index and the position. I'm not comparing based on this functionality.
But when filtering, doing "where
" clauses what's the difference between using or not using loc
? If any. And why do all the examples I come across regarding this subject use loc
?
Boolean indexing helps us to select the data from the DataFrames using a boolean vector. We need a DataFrame with a boolean index to use the boolean indexing.
The query function seams more efficient than the loc function. DF2: 2K records x 6 columns. The loc function seams much more efficient than the query function.
loc is used to select data by label. The labels of columns are the column names. For example, customerID, gender, SeniorCitizen are the first three column names (i.e. labels). We need to be careful about row labels.
The main difference between pandas loc[] vs iloc[] is loc gets DataFrame rows & columns by labels/names and iloc[] gets by integer Index/position. For loc[], if the label is not present it gives a key error. For iloc[], if the position is not present it gives an index error.
As per the docs, loc
accepts a boolean array for selecting rows, and in your case
>>> df['a'] >= 15
>>>
0 False
1 False
2 False
3 False
4 False
5 False
6 True
7 True
Name: a, dtype: bool
is treated as a boolean array.
The fact that you can omit loc
here and issue df[df['a'] >= 15]
is a special case convenience according to Wes McKinney, the author of pandas
.
Quoting directly from his book, Python for Data Analysis, p. 144, df[val]
is used to...
Select single column or sequence of columns from the DataFrame; special case conveniences: boolean array (filter rows), slice (slice rows), or boolean DataFrame (set values based on some criterion)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With