This is more of a conceptual question, I do not have a specific problem.
I am learning python for data analysis, but I am very familiar with R - one of the great things about R is plyr (and of course ggplot2) and even better dplyr. Pandas of course has split-apply as well however in R I can do things like (in dplyr, a bit different in plyr, and I can see now how dplyr mimics the . notation from object programming)
data %.% group_by(c(.....)) %.% summarise(new1 = ...., new2 = ...., ..... newn=....)
in which I create multiple summary calculations at the same time
How do I do that in python, because
df[...].groupby(.....).sum() only sums columns,
while on R I can have one mean, one sum, one special function, etc. on one call
I realize I can do all my operations separately and merge them, and that is fine if I am using python, but when it comes down to choosing a tool, any line of code you do not have to type and check and validate adds up in time
in addition, in dplyr you can also add mutate statements as well, so it seems to me it is way more powerful - so what am I missing about pandas or python -
My goal is to learn, I have spent a lot of effort to learn python and it is a worthy investment, but still the question remains
It was a mostly correct assumption as but both dplyr and plyr has summarize functions, while dplyr has group_by , but plyr doesn't. If you import plyr at a later chunk and then rerun the expression shown in the question, summarize is assumed to be the one from plyr namespace.
Dplython. Package dplython is dplyr for Python users. It provide infinite functionality for data preprocessing.
Winner – dplyr. Another no-brainer Pandas vs. dplyr comparison. The syntax of dplyr is much cleaner and easier to read.
One of the great things about the R world has been a collection of R packages called tidyverse that are easy for beginners to learn and provide a consistent data manipulation and visualisation space. The value of these tools has been so great that many of them have been ported to Python.
I'm also a big fan of dplyr for R and am working to improve my knowledge of Pandas. Since you don't have a specific problem, I'd suggest checking out the post below that breaks down the entire introductory dplyr vignette and shows how all of it can be done with Pandas.
For example, the author demonstrates chaining with the pipe operator in R:
flights %>%
group_by(year, month, day) %>%
select(arr_delay, dep_delay) %>%
summarise(
arr = mean(arr_delay, na.rm = TRUE),
dep = mean(dep_delay, na.rm = TRUE)
) %>%
filter(arr > 30 | dep > 30)
And here is the Pandas implementation:
flights.groupby(['year', 'month', 'day'])
[['arr_delay', 'dep_delay']]
.mean()
.query('arr_delay > 30 | dep_delay > 30')
There are many more comparisons of how to implement dplyr like operations with Pandas at the original post. http://nbviewer.ipython.org/gist/TomAugspurger/6e052140eaa5fdb6e8c0
One could simply use dplyr from Python.
There is an interface to dplyr
in rpy2 (introduced with rpy2-2.7.0) that lets you write things like:
dataf = (DataFrame(mtcars).
filter('gear>3').
mutate(powertoweight='hp*36/wt').
group_by('gear').
summarize(mean_ptw='mean(powertoweight)'))
There is an example in the documentation. This part of the doc is (also) a jupyter notebook. Look for the links near the top of page.
An other answer to the question is comparing R's dplyr and pandas (see @lgallen). That same R one-liner chaining dplyr statements write's essentially the same in rpy2's interface to dplyr.
R:
flights %>%
group_by(year, month, day) %>%
select(arr_delay, dep_delay) %>%
summarise(
arr = mean(arr_delay, na.rm = TRUE),
dep = mean(dep_delay, na.rm = TRUE)
) %>%
filter(arr > 30 | dep > 30)
Python+rpy2:
(DataFrame(flights).
group_by('year', 'month', 'day').
select('arr_delay', 'dep_delay').
summarize(arr = 'mean(arr_delay, na.rm=TRUE)',
dep = 'mean(dep_delay, na.rm=TRUE)').
filter('arr > 30 | dep > 30'))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With