This is more of a conceptual question, I do not have a specific problem. I am learning python for data analysis, but I am very familiar with R - one of the great things about R is plyr (and of course ggplot2) and even better dplyr. Pandas of course has split-apply as well however in R I can do things like (in dplyr, a bit different in plyr, and I can see now how dplyr mimics the . notation from object programming) <pre class="prettyprint"><code> data %.% group_by(c(.....)) %.% summarise(new1 = ...., new2 = ...., ..... newn=....) </code></pre> in which I create multiple summary calculations at the same time How do I do that in python, because <pre class="prettyprint"><code>df[...].groupby(.....).sum() only sums columns, </code></pre> while on R I can have one mean, one sum, one special function, etc. on one call I realize I can do all my operations separately and merge them, and that is fine if I am using python, but when it comes down to choosing a tool, any line of code you do not have to type and check and validate adds up in time in addition, in dplyr you can also add mutate statements as well, so it seems to me it is way more powerful - so what am I missing about pandas or python - My goal is to learn, I have spent a lot of effort to learn python and it is a worthy investment, but still the question remains

I'm also a big fan of dplyr for R and am working to improve my knowledge of Pandas. Since you don't have a specific problem, I'd suggest checking out the post below that breaks down the entire introductory dplyr vignette and shows how all of it can be done with Pandas. For example, the author demonstrates chaining with the pipe operator in R: <pre class="prettyprint"><code> flights %>% group_by(year, month, day) %>% select(arr_delay, dep_delay) %>% summarise( arr = mean(arr_delay, na.rm = TRUE), dep = mean(dep_delay, na.rm = TRUE) ) %>% filter(arr > 30 | dep > 30) </code></pre> And here is the Pandas implementation: <pre class="prettyprint"><code>flights.groupby(['year', 'month', 'day']) [['arr_delay', 'dep_delay']] .mean() .query('arr_delay > 30 | dep_delay > 30') </code></pre> There are many more comparisons of how to implement dplyr like operations with Pandas at the original post. http://nbviewer.ipython.org/gist/TomAugspurger/6e052140eaa5fdb6e8c0

plyr or dplyr in Python

Tags:

This is more of a conceptual question, I do not have a specific problem.

I am learning python for data analysis, but I am very familiar with R - one of the great things about R is plyr (and of course ggplot2) and even better dplyr. Pandas of course has split-apply as well however in R I can do things like (in dplyr, a bit different in plyr, and I can see now how dplyr mimics the . notation from object programming)

Click to copy

   data %.% group_by(c(.....)) %.% summarise(new1 = ...., new2 = ...., ..... newn=....)

in which I create multiple summary calculations at the same time

How do I do that in python, because

Click to copy

df[...].groupby(.....).sum() only sums columns,

while on R I can have one mean, one sum, one special function, etc. on one call

I realize I can do all my operations separately and merge them, and that is fine if I am using python, but when it comes down to choosing a tool, any line of code you do not have to type and check and validate adds up in time

in addition, in dplyr you can also add mutate statements as well, so it seems to me it is way more powerful - so what am I missing about pandas or python -

My goal is to learn, I have spent a lot of effort to learn python and it is a worthy investment, but still the question remains

641

asked Nov 12 '14 02:11

user1617979

2 Answers

I'm also a big fan of dplyr for R and am working to improve my knowledge of Pandas. Since you don't have a specific problem, I'd suggest checking out the post below that breaks down the entire introductory dplyr vignette and shows how all of it can be done with Pandas.

For example, the author demonstrates chaining with the pipe operator in R:

Click to copy

 flights %>%
   group_by(year, month, day) %>%
   select(arr_delay, dep_delay) %>%
   summarise(
      arr = mean(arr_delay, na.rm = TRUE),
      dep = mean(dep_delay, na.rm = TRUE)
       ) %>%
   filter(arr > 30 | dep > 30)

And here is the Pandas implementation:

Click to copy

flights.groupby(['year', 'month', 'day'])
   [['arr_delay', 'dep_delay']]
   .mean()
   .query('arr_delay > 30 | dep_delay > 30')

There are many more comparisons of how to implement dplyr like operations with Pandas at the original post. http://nbviewer.ipython.org/gist/TomAugspurger/6e052140eaa5fdb6e8c0

117

answered Oct 04 '22 06:10

lgallen

One could simply use dplyr from Python.

There is an interface to dplyr in rpy2 (introduced with rpy2-2.7.0) that lets you write things like:

Click to copy

dataf = (DataFrame(mtcars).
         filter('gear>3').
         mutate(powertoweight='hp*36/wt').
         group_by('gear').
         summarize(mean_ptw='mean(powertoweight)'))

There is an example in the documentation. This part of the doc is (also) a jupyter notebook. Look for the links near the top of page.

An other answer to the question is comparing R's dplyr and pandas (see @lgallen). That same R one-liner chaining dplyr statements write's essentially the same in rpy2's interface to dplyr.

Click to copy

flights %>%
   group_by(year, month, day) %>%
   select(arr_delay, dep_delay) %>%
   summarise(
      arr = mean(arr_delay, na.rm = TRUE),
      dep = mean(dep_delay, na.rm = TRUE)
      ) %>%
   filter(arr > 30 | dep > 30)

Python+rpy2:

Click to copy

(DataFrame(flights).
 group_by('year', 'month', 'day').
 select('arr_delay', 'dep_delay').
 summarize(arr = 'mean(arr_delay, na.rm=TRUE)',
           dep = 'mean(dep_delay, na.rm=TRUE)').
 filter('arr > 30 | dep > 30'))

answered Oct 04 '22 04:10

lgautier

Related questions
                            
                                Performing a segue from a button within a custom UITableViewCell
                            
                                ExpressJS vs MeteorJS [closed]
                            
                                How can I commit files currently displayed in Vim with fugitive?
                            
                                Can't install anything with npm ECONNRESET without proxy
                            
                                Intellij "java: package org.junit does not exist"
                            
                                Swift - Associated value or extension for an Enum
                            
                                APKtools (APK Studio) Could not decode arsc file
                            
                                React-router: type.toUpperCase is not a function
                            
                                Using UIBezierPath:byRoundingCorners: with Swift 2 and Swift 3
                            
                                Error:failed to find Build Tools revision 23.0.0 rc3
                            
                                Get wwwroot folder path from ASP.NET 5 controller VS 2015
                            
                                Javascript ES6 - map multiple arrays

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

plyr or dplyr in Python

Tags:

user1617979

People also ask

2 Answers

lgallen

lgautier

Recent Activity

Donate For Us