Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to find mean value of rows of a data frame based on values in a column in Julia?

Tags:

julia

I have the following data frame in Julia.

using DataFrames 
data = DataFrame(Value = [23, 56, 10, 48, 51], Type = ["A", "B", "A", "B", "B"])

5×2 DataFrame
│ Row │ Value │ Type   │
│     │ Int64 │ String │
├─────┼───────┼────────┤
│ 1   │ 23    │ A      │
│ 2   │ 56    │ B      │
│ 3   │ 10    │ A      │
│ 4   │ 48    │ B      │
│ 5   │ 51    │ B      │

How to get the mean of the column Value based on the column Type?

like image 982
Qwerty Avatar asked Nov 23 '25 14:11

Qwerty


1 Answers

If you want performance consider the following options

julia> using DataFrames

julia> using Statistics

julia> using BenchmarkTools

julia> data = DataFrame(Value = rand(1:10, 10^6),
                        Type = categorical(rand(["A", "B"], 10^6)));

Note that I generate :Type column as categorical, as this will be much faster to aggregate later.

First a timing from the answer above:

julia> @benchmark by($data, [:Type], df -> mean(df[:, :Value]))
BenchmarkTools.Trial: 
  memory estimate:  30.53 MiB
  allocs estimate:  212
  --------------
  minimum time:     12.173 ms (0.00% GC)
  median time:      13.305 ms (3.63% GC)
  mean time:        14.229 ms (4.30% GC)
  maximum time:     20.491 ms (2.98% GC)
  --------------
  samples:          352
  evals/sample:     1

Here is a timing where I change df[:, :Value] to df.Value. The difference is that df.Value does not copy data unnecessarily. You can see that already you save over 10% of run time:

julia> @benchmark by($data, :Type, df -> mean(df.Value))
BenchmarkTools.Trial: 
  memory estimate:  22.90 MiB
  allocs estimate:  203
  --------------
  minimum time:     10.926 ms (0.00% GC)
  median time:      13.151 ms (1.92% GC)
  mean time:        13.093 ms (3.53% GC)
  maximum time:     16.933 ms (3.25% GC)
  --------------
  samples:          382
  evals/sample:     1

And here is an efficient way to write it. This statement means that we pass column :Value to a function mean:

julia> @benchmark by($data, :Type, :Value => mean)
BenchmarkTools.Trial: 
  memory estimate:  15.27 MiB
  allocs estimate:  190
  --------------
  minimum time:     8.326 ms (0.00% GC)
  median time:      8.667 ms (0.00% GC)
  mean time:        9.599 ms (2.74% GC)
  maximum time:     17.364 ms (3.57% GC)
  --------------
  samples:          521
  evals/sample:     1

To finalize let us just check the difference if :Value is a Vector{String} (the approach that is given in the other answer):

julia> data.Type = String.(data.Type);

julia> @benchmark by($data, [:Type], df -> mean(df[:, :Value]))
BenchmarkTools.Trial: 
  memory estimate:  46.16 MiB
  allocs estimate:  197
  --------------
  minimum time:     26.664 ms (2.08% GC)
  median time:      27.197 ms (2.11% GC)
  mean time:        27.486 ms (2.11% GC)
  maximum time:     35.740 ms (1.64% GC)
  --------------
  samples:          182
  evals/sample:     1

And you can see that it is around three times slower than the recommended answer. Also note that:

julia> by(data, :Type, :Value => mean)
2×2 DataFrame
│ Row │ Type   │ Value_mean │
│     │ String │ Float64    │
├─────┼────────┼────────────┤
│ 1   │ B      │ 5.50175    │
│ 2   │ A      │ 5.49524    │

produces a nicer default name for the generated column (as it knows the source column name and the transformation function name).

like image 197
Bogumił Kamiński Avatar answered Nov 25 '25 09:11

Bogumił Kamiński



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!