I am a little confused as to why the syntax for referring to a column within a pandas data frame differs depending on which method is being called. Take the following sample method chain
import pandas as pd
iris = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')
iris.columns = ['SepalLength', 'SepalWidth', 'PetalLength', 'PetalWidth', 'Species']
(iris
.loc[:, ['SepalLength', 'PetalWidth', 'Species']]
.where(iris['SepalLength'] > 4.6)
.assign(PetalWidthx2 = lambda x_iris: x_iris['PetalWidth'] * 2)
.groupby('Species')
.agg({'SepalLength': 'mean', 'PetalWidthx2': 'std'}))
Here, there are three different kinds of syntax used to refer to columns within the iris data frame:
loc
, groupby
, and agg
all understand that a string refers to a column in the data frame.where
needs the data frame to be explicitly referenced.assign
method would cause the operation to be performed on the original iris data frame, and not the copy that has been modified by the calls to loc
and where
. Here, lambda
is needed to refer to the current state of the modified data frame copy.In addition to the above, there is also query
, which takes the entire method input as a string: iris.query('SepalLength > 4.6')
, but here the pandas documentation explicilty states that this is for special use cases:
A use case for query() is when you have a collection of DataFrame objects that have a subset of column names (or index levels/names) in common. You can pass the same query to both frames without having to specify which frame you’re interested in querying
To provide an example of what I mean by consistent data frame column reference syntax, a comparison could be made to the R-package dplyr
, where columns in the data frame are referenced with the same syntax for all the piped function calls.
library(dplyr)
# The iris data set is preloaded in R
colnames(iris) = c('SepalLength', 'SepalWidth', 'PetalLength', 'PetalWidth', 'Species')
iris %>%
select(SepalLength, PetalWidth, Species) %>%
filter(SepalLength > 4.6) %>%
mutate(PetalWidth2x = PetalWidth * 2) %>%
group_by(Species) %>%
summarise(SepalLength = mean(SepalLength), PetalWidth2x = sd(PetalWidth2x))
Are there advantages that pandas gains from having these different ways of referring to data frame columns, instead of applying the simplistic syntax used by loc
, groupby
and agg
to all the methods (if so, which are these benefits)? Or is this more of a workaround for some underlying issue with using strings for data frame column names in the assign
and where
methods?
To quote from Marius's comment:
I think the biggest difference between pandas and dplyr is that pandas works within Python's existing syntax rules, which are pretty strict with respect to what unquoted symbols can represent (basically objects in the current scope)...
I believe that this correct, so let's expand on it a bit.
loc
,groupby
, andagg
all understand that a string refers to a column in the data frame.
.loc[:, ['SepalLength', 'PetalWidth', 'Species']]
.groupby('Species')
.agg({'SepalLength': 'mean', 'PetalWidthx2': 'std'}))
In all three cases, a string is a valid entity in that context. That is, the string alone gives enough information to perform the operation. Unlike...
where
needs the data frame to be explicitly referenced.
.where(iris['SepalLength'] > 4.6)
In the case of where
, Python requires the operator >
to operate against something. By selecting a specific data frame column, an object was referenced for that column, and the __gt__
method will be called against that object.
If instead we wanted the syntax to look like:
.where('SepalLength' > 4.6)
We would need some way to tell Python what the >
operator meant in this context. The evaluation happens before being passed to where
. The existing language feature for this is to provide our own object, with appropriate methods defined, and this is what the pandas designers did. The default >
operation for a string is just not useful is this context.
Explicitly referring to the data frame in the
assign
method would cause the operation to be performed on the original iris data frame, and not the copy that has been modified by the calls toloc
andwhere
. Here,lambda
is needed to refer to the current state of the modified data frame copy.
.assign(PetalWidthx2 = lambda x_iris: x_iris['PetalWidth'] * 2)
If .assign
would have been used as the first method of the data frame, prior to any filtering taking place, we could simply have written this as
.assign(PetalWidthx2 = iris['PetalWidth'] * 2)
since the variable iris
already exists and is identical to the data frame we want to operate on.
However, since the previous calls to .loc
and .where
changes the dataframe we wish to call .assign
on, it is no longer identical to the iris
dataframe and there is no defined variable referencing the modified data frame. As pandas is using Python's existing syntax rules, it can utilize lambda
, which in this context essentially allows for operations on self
: the current state of the object which .assign
is called on. There is an example of this in the docs.
This uses the **kwargs of a method, which allows to specify an arbitrary number of parameters (new column names) and their arguments (the value for the new column). The **kwargs parameter=argument
pairs are interpreted internally as a dictionary key:value
pair, as can be seen in the source.
In addition to the above, there is also
query
, which takes the entire method input as a string:iris.query('SepalLength > 4.6')
, but here the pandas documentation explicitly states that this is for special use cases
In the case of query
, the string passed is an expression that will be compiled and executed by a backend that is generally much faster than executing python code. It is a special case because the available operations are quite limited, and the setup time for the backend engine is large, so it is really only useful for fairly large data sets.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With