Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Groupby with sum on Julia Dataframe

I am trying to make a groupby + sum on a Julia Dataframe with Int and String values

For instance, df :

│ Row │ A      │ B      │ C     │ D      │
│     │ String │ String │ Int64 │ String │
├─────┼────────┼────────┼───────┼────────┤
│ 1   │ x1     │ a      │ 12    │ green  │
│ 2   │ x2     │ a      │ 7     │ blue   │
│ 3   │ x1     │ b      │ 5     │ red    │
│ 4   │ x2     │ a      │ 4     │ blue   │
│ 5   │ x1     │ b      │ 9     │ yellow │

To do this in Python, the command could be :

df_group = df.groupby(['A', 'B']).sum().reset_index()

I will obtain the following output result with the initial column labels :

    A  B   C
0  x1  a  12
1  x1  b  14
2  x2  a  11

I would like to do the same thing in Julia. I tried this way, unsuccessfully :

df_group = aggregate(df, ["A", "B"], sum)

MethodError: no method matching +(::String, ::String)

Have you any idea of a way to do this in Julia ?

like image 592
Bebio Avatar asked Oct 06 '20 13:10

Bebio


Video Answer


2 Answers

Try (actually instead of non-string columns, probably you want columns that are numeric):

numcols = names(df, findall(x -> eltype(x) <: Number, eachcol(df)))
combine(groupby(df, ["A", "B"]), numcols .=> sum .=> numcols)

and if you want to allow missing values (and skip them when doing a summation) then:

numcols = names(df, findall(x -> eltype(x) <: Union{Missing,Number}, eachcol(df)))
combine(groupby(df, ["A", "B"]), numcols .=> sum∘skipmissing .=> numcols)
like image 184
Bogumił Kamiński Avatar answered Oct 07 '22 12:10

Bogumił Kamiński


Julia DataFrames support split-apply-combine logic, similar to pandas, so aggregation looks like

using DataFrames

df = DataFrame(:A => ["x1", "x2", "x1", "x2", "x1"], 
               :B => ["a", "a", "b", "a", "b"],
               :C => [12, 7, 5, 4, 9],
               :D => ["green", "blue", "red", "blue", "yellow"])

gdf = groupby(df, [:A, :B])
combine(gdf, :C => sum)

with the result

julia> combine(gdf, :C => sum)
3×3 DataFrame
│ Row │ A      │ B      │ C_sum │
│     │ String │ String │ Int64 │
├─────┼────────┼────────┼───────┤
│ 1   │ x1     │ a      │ 12    │
│ 2   │ x2     │ a      │ 11    │
│ 3   │ x1     │ b      │ 14    │

You can skip the creation of gdf with the help of Pipe.jl or Underscores.jl

using Underscores

@_ groupby(df, [:A, :B]) |> combine(__, :C => sum)

You can give name to the new column with the following syntax

julia> @_ groupby(df, [:A, :B]) |> combine(__, :C => sum => :C)
3×3 DataFrame
│ Row │ A      │ B      │ C     │
│     │ String │ String │ Int64 │
├─────┼────────┼────────┼───────┤
│ 1   │ x1     │ a      │ 12    │
│ 2   │ x2     │ a      │ 11    │
│ 3   │ x1     │ b      │ 14    │
like image 33
Andrej Oskin Avatar answered Oct 07 '22 12:10

Andrej Oskin