Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Julia DataFrames.jl, Groupby and summing multiple columns

I am wondering how to use the "by" function to group by and sum multiple columns. If I wanted to groupby one column I can do it by this

someData = DataFrame(:Countries => ["Afganistan","Albainia","Albainia","Andorra","Angola","Angola"],:population => rand(100:1000,6), :GDP => rand(1:100,6))

by(someData, :Countries, df ->DataFrame(pop_sum = sum(df[:population])))

However, I wanted to get the sum of both population and GDP. I tried something like below which is of course incorrect. Any ideas?

by(someData, :Countries, df ->DataFrame(pop_sum, GDP_sum = sum(df[[:population,:GDP]])))
like image 515
imantha Avatar asked Mar 06 '26 03:03

imantha


1 Answers

Do not use by function as it is deprecated. Instead use this (you do not see the warning as probably you are starting Julia with --depwarn set to no which is the default):

julia> someData = DataFrame(:Countries => ["Afganistan","Albainia","Albainia","Andorra","Angola","Angola"],
                            :population => rand(100:1000,6),
                            :GDP => rand(1:100,6))
6×3 DataFrame
│ Row │ Countries  │ population │ GDP   │
│     │ String     │ Int64      │ Int64 │
├─────┼────────────┼────────────┼───────┤
│ 1   │ Afganistan │ 543        │ 29    │
│ 2   │ Albainia   │ 853        │ 71    │
│ 3   │ Albainia   │ 438        │ 81    │
│ 4   │ Andorra    │ 860        │ 88    │
│ 5   │ Angola     │ 940        │ 64    │
│ 6   │ Angola     │ 688        │ 40    │

julia> combine(groupby(someData, :Countries), [:population, :GDP] .=> sum)
4×3 DataFrame
│ Row │ Countries  │ population_sum │ GDP_sum │
│     │ String     │ Int64          │ Int64   │
├─────┼────────────┼────────────────┼─────────┤
│ 1   │ Afganistan │ 543            │ 29      │
│ 2   │ Albainia   │ 1291           │ 152     │
│ 3   │ Andorra    │ 860            │ 88      │
│ 4   │ Angola     │ 1628           │ 104     │

The alternative way to write it would be:

julia> combine(groupby(someData, :Countries)) do sdf
       return (population_sum = sum(sdf.population), GDP_sum=sum(sdf.GDP))
       end
4×3 DataFrame
│ Row │ Countries  │ population_sum │ GDP_sum │
│     │ String     │ Int64          │ Int64   │
├─────┼────────────┼────────────────┼─────────┤
│ 1   │ Afganistan │ 543            │ 29      │
│ 2   │ Albainia   │ 1291           │ 152     │
│ 3   │ Andorra    │ 860            │ 88      │
│ 4   │ Angola     │ 1628           │ 104     │

but it is more verbose in this case (it would be useful if you wanted to do more complex preprocessing of the data before returning the value).

like image 106
Bogumił Kamiński Avatar answered Mar 08 '26 02:03

Bogumił Kamiński



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!