Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

plyr or dplyr in Python

Tags:

This is more of a conceptual question, I do not have a specific problem.

I am learning python for data analysis, but I am very familiar with R - one of the great things about R is plyr (and of course ggplot2) and even better dplyr. Pandas of course has split-apply as well however in R I can do things like (in dplyr, a bit different in plyr, and I can see now how dplyr mimics the . notation from object programming)

   data %.% group_by(c(.....)) %.% summarise(new1 = ...., new2 = ...., ..... newn=....)

in which I create multiple summary calculations at the same time

How do I do that in python, because

df[...].groupby(.....).sum() only sums columns, 

while on R I can have one mean, one sum, one special function, etc. on one call

I realize I can do all my operations separately and merge them, and that is fine if I am using python, but when it comes down to choosing a tool, any line of code you do not have to type and check and validate adds up in time

in addition, in dplyr you can also add mutate statements as well, so it seems to me it is way more powerful - so what am I missing about pandas or python -

My goal is to learn, I have spent a lot of effort to learn python and it is a worthy investment, but still the question remains

like image 641
user1617979 Avatar asked Nov 12 '14 02:11

user1617979


People also ask

Is PLYR the same as dplyr?

It was a mostly correct assumption as but both dplyr and plyr has summarize functions, while dplyr has group_by , but plyr doesn't. If you import plyr at a later chunk and then rerun the expression shown in the question, summarize is assumed to be the one from plyr namespace.

Is there a dplyr equivalent in Python?

Dplython. Package dplython is dplyr for Python users. It provide infinite functionality for data preprocessing.

Which is better dplyr or Pandas?

Winner – dplyr. Another no-brainer Pandas vs. dplyr comparison. The syntax of dplyr is much cleaner and easier to read.

Is there a tidyverse for Python?

One of the great things about the R world has been a collection of R packages called tidyverse that are easy for beginners to learn and provide a consistent data manipulation and visualisation space. The value of these tools has been so great that many of them have been ported to Python.


2 Answers

I'm also a big fan of dplyr for R and am working to improve my knowledge of Pandas. Since you don't have a specific problem, I'd suggest checking out the post below that breaks down the entire introductory dplyr vignette and shows how all of it can be done with Pandas.

For example, the author demonstrates chaining with the pipe operator in R:

 flights %>%
   group_by(year, month, day) %>%
   select(arr_delay, dep_delay) %>%
   summarise(
      arr = mean(arr_delay, na.rm = TRUE),
      dep = mean(dep_delay, na.rm = TRUE)
       ) %>%
   filter(arr > 30 | dep > 30)

And here is the Pandas implementation:

flights.groupby(['year', 'month', 'day'])
   [['arr_delay', 'dep_delay']]
   .mean()
   .query('arr_delay > 30 | dep_delay > 30')

There are many more comparisons of how to implement dplyr like operations with Pandas at the original post. http://nbviewer.ipython.org/gist/TomAugspurger/6e052140eaa5fdb6e8c0

like image 117
lgallen Avatar answered Oct 04 '22 06:10

lgallen


One could simply use dplyr from Python.

There is an interface to dplyr in rpy2 (introduced with rpy2-2.7.0) that lets you write things like:

dataf = (DataFrame(mtcars).
         filter('gear>3').
         mutate(powertoweight='hp*36/wt').
         group_by('gear').
         summarize(mean_ptw='mean(powertoweight)'))

There is an example in the documentation. This part of the doc is (also) a jupyter notebook. Look for the links near the top of page.

An other answer to the question is comparing R's dplyr and pandas (see @lgallen). That same R one-liner chaining dplyr statements write's essentially the same in rpy2's interface to dplyr.

R:

flights %>%
   group_by(year, month, day) %>%
   select(arr_delay, dep_delay) %>%
   summarise(
      arr = mean(arr_delay, na.rm = TRUE),
      dep = mean(dep_delay, na.rm = TRUE)
      ) %>%
   filter(arr > 30 | dep > 30)

Python+rpy2:

(DataFrame(flights).
 group_by('year', 'month', 'day').
 select('arr_delay', 'dep_delay').
 summarize(arr = 'mean(arr_delay, na.rm=TRUE)',
           dep = 'mean(dep_delay, na.rm=TRUE)').
 filter('arr > 30 | dep > 30'))
like image 31
lgautier Avatar answered Oct 04 '22 04:10

lgautier