Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Seemingly inconsistent column reference syntax when chaining methods on pandas data frames

I am a little confused as to why the syntax for referring to a column within a pandas data frame differs depending on which method is being called. Take the following sample method chain

import pandas as pd

iris = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')
iris.columns = ['SepalLength', 'SepalWidth', 'PetalLength', 'PetalWidth', 'Species']
(iris
    .loc[:, ['SepalLength', 'PetalWidth', 'Species']]
    .where(iris['SepalLength'] > 4.6)
    .assign(PetalWidthx2 = lambda x_iris: x_iris['PetalWidth'] * 2)
    .groupby('Species')
    .agg({'SepalLength': 'mean', 'PetalWidthx2': 'std'}))

Here, there are three different kinds of syntax used to refer to columns within the iris data frame:

  • loc, groupby, and agg all understand that a string refers to a column in the data frame.
  • where needs the data frame to be explicitly referenced.
  • Explicitly referring to the data frame in the assign method would cause the operation to be performed on the original iris data frame, and not the copy that has been modified by the calls to loc and where. Here, lambda is needed to refer to the current state of the modified data frame copy.
  • In addition to the above, there is also query, which takes the entire method input as a string: iris.query('SepalLength > 4.6'), but here the pandas documentation explicilty states that this is for special use cases:

    A use case for query() is when you have a collection of DataFrame objects that have a subset of column names (or index levels/names) in common. You can pass the same query to both frames without having to specify which frame you’re interested in querying

To provide an example of what I mean by consistent data frame column reference syntax, a comparison could be made to the R-package dplyr, where columns in the data frame are referenced with the same syntax for all the piped function calls.

library(dplyr)

# The iris data set is preloaded in R
colnames(iris) = c('SepalLength', 'SepalWidth', 'PetalLength', 'PetalWidth', 'Species')
iris %>% 
    select(SepalLength, PetalWidth, Species) %>% 
    filter(SepalLength > 4.6) %>%  
    mutate(PetalWidth2x = PetalWidth * 2) %>% 
    group_by(Species) %>% 
    summarise(SepalLength = mean(SepalLength), PetalWidth2x = sd(PetalWidth2x))

Are there advantages that pandas gains from having these different ways of referring to data frame columns, instead of applying the simplistic syntax used by loc, groupby and agg to all the methods (if so, which are these benefits)? Or is this more of a workaround for some underlying issue with using strings for data frame column names in the assign and where methods?

like image 788
joelostblom Avatar asked May 19 '17 01:05

joelostblom


1 Answers

To quote from Marius's comment:

I think the biggest difference between pandas and dplyr is that pandas works within Python's existing syntax rules, which are pretty strict with respect to what unquoted symbols can represent (basically objects in the current scope)...

I believe that this correct, so let's expand on it a bit.


loc, groupby, and agg all understand that a string refers to a column in the data frame.

.loc[:, ['SepalLength', 'PetalWidth', 'Species']]
.groupby('Species')
.agg({'SepalLength': 'mean', 'PetalWidthx2': 'std'}))

In all three cases, a string is a valid entity in that context. That is, the string alone gives enough information to perform the operation. Unlike...


where needs the data frame to be explicitly referenced.

.where(iris['SepalLength'] > 4.6)

In the case of where, Python requires the operator > to operate against something. By selecting a specific data frame column, an object was referenced for that column, and the __gt__ method will be called against that object.

If instead we wanted the syntax to look like:

.where('SepalLength' > 4.6)

We would need some way to tell Python what the > operator meant in this context. The evaluation happens before being passed to where. The existing language feature for this is to provide our own object, with appropriate methods defined, and this is what the pandas designers did. The default > operation for a string is just not useful is this context.


Explicitly referring to the data frame in the assign method would cause the operation to be performed on the original iris data frame, and not the copy that has been modified by the calls to loc and where. Here, lambda is needed to refer to the current state of the modified data frame copy.

.assign(PetalWidthx2 = lambda x_iris: x_iris['PetalWidth'] * 2)

If .assign would have been used as the first method of the data frame, prior to any filtering taking place, we could simply have written this as

.assign(PetalWidthx2 = iris['PetalWidth'] * 2)

since the variable iris already exists and is identical to the data frame we want to operate on.

However, since the previous calls to .loc and .where changes the dataframe we wish to call .assign on, it is no longer identical to the iris dataframe and there is no defined variable referencing the modified data frame. As pandas is using Python's existing syntax rules, it can utilize lambda, which in this context essentially allows for operations on self: the current state of the object which .assign is called on. There is an example of this in the docs.

This uses the **kwargs of a method, which allows to specify an arbitrary number of parameters (new column names) and their arguments (the value for the new column). The **kwargs parameter=argument pairs are interpreted internally as a dictionary key:value pair, as can be seen in the source.


In addition to the above, there is also query, which takes the entire method input as a string: iris.query('SepalLength > 4.6'), but here the pandas documentation explicitly states that this is for special use cases

In the case of query, the string passed is an expression that will be compiled and executed by a backend that is generally much faster than executing python code. It is a special case because the available operations are quite limited, and the setup time for the backend engine is large, so it is really only useful for fairly large data sets.

like image 120
Stephen Rauch Avatar answered Oct 11 '22 17:10

Stephen Rauch