So in R when I have a data frame consisting of say 4 columns, call it df
and I want to compute the ratio by sum product of a group, I can it in such a way:
// generate data df = data.frame(a=c(1,1,0,1,0),b=c(1,0,0,1,0),c=c(10,5,1,5,10),d=c(3,1,2,1,2)); | a b c d | | 1 1 10 3 | | 1 0 5 1 | | 0 0 1 2 | | 1 1 5 1 | | 0 0 10 2 | // compute sum product ratio df = df%>% group_by(a,b) %>% mutate( ratio=c/sum(c*d) ); | a b c d ratio | | 1 1 10 3 0.286 | | 1 1 5 1 0.143 | | 1 0 5 1 1 | | 0 0 1 2 0.045 | | 0 0 10 2 0.454 |
But in python I need to resort to loops. I know there should be a more elegant way than raw loops in python, anyone got any ideas?
mutate() allows you to create new columns in the DataFrame. The new columns can be composed from existing columns. For example, let's create two new columns: one by dividing the distance column by 1000 , and the other by concatenating the carrier and origin columns.
Dplython. Package dplython is dplyr for Python users. It provide infinite functionality for data preprocessing.
Learn More. Heey great post, but pandas has very similar functions as dplyr. If you use those instead, you get statements very similar to your dplyr statements and you would get the same readability.
Pandas definitely takes longer to get used to than Tidyverse but the payoff is that you get to use Python, which is a somewhat "deeper" language than R. R is great for interactive work, and for data munging jobs that don't interact too much with non-R libraries. However Python is sinply more versatile end-to-end.
It can be done with similar syntax with groupby()
and apply()
:
df['ratio'] = df.groupby(['a','b'], group_keys=False).apply(lambda g: g.c/(g.c * g.d).sum())
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With