I have the following data frame in Julia.
using DataFrames
data = DataFrame(Value = [23, 56, 10, 48, 51], Type = ["A", "B", "A", "B", "B"])
5×2 DataFrame
│ Row │ Value │ Type │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 23 │ A │
│ 2 │ 56 │ B │
│ 3 │ 10 │ A │
│ 4 │ 48 │ B │
│ 5 │ 51 │ B │
How to get the mean of the column Value based on the column Type?
If you want performance consider the following options
julia> using DataFrames
julia> using Statistics
julia> using BenchmarkTools
julia> data = DataFrame(Value = rand(1:10, 10^6),
Type = categorical(rand(["A", "B"], 10^6)));
Note that I generate :Type column as categorical, as this will be much faster to aggregate later.
First a timing from the answer above:
julia> @benchmark by($data, [:Type], df -> mean(df[:, :Value]))
BenchmarkTools.Trial:
memory estimate: 30.53 MiB
allocs estimate: 212
--------------
minimum time: 12.173 ms (0.00% GC)
median time: 13.305 ms (3.63% GC)
mean time: 14.229 ms (4.30% GC)
maximum time: 20.491 ms (2.98% GC)
--------------
samples: 352
evals/sample: 1
Here is a timing where I change df[:, :Value] to df.Value. The difference is that df.Value does not copy data unnecessarily. You can see that already you save over 10% of run time:
julia> @benchmark by($data, :Type, df -> mean(df.Value))
BenchmarkTools.Trial:
memory estimate: 22.90 MiB
allocs estimate: 203
--------------
minimum time: 10.926 ms (0.00% GC)
median time: 13.151 ms (1.92% GC)
mean time: 13.093 ms (3.53% GC)
maximum time: 16.933 ms (3.25% GC)
--------------
samples: 382
evals/sample: 1
And here is an efficient way to write it. This statement means that we pass column :Value to a function mean:
julia> @benchmark by($data, :Type, :Value => mean)
BenchmarkTools.Trial:
memory estimate: 15.27 MiB
allocs estimate: 190
--------------
minimum time: 8.326 ms (0.00% GC)
median time: 8.667 ms (0.00% GC)
mean time: 9.599 ms (2.74% GC)
maximum time: 17.364 ms (3.57% GC)
--------------
samples: 521
evals/sample: 1
To finalize let us just check the difference if :Value is a Vector{String} (the approach that is given in the other answer):
julia> data.Type = String.(data.Type);
julia> @benchmark by($data, [:Type], df -> mean(df[:, :Value]))
BenchmarkTools.Trial:
memory estimate: 46.16 MiB
allocs estimate: 197
--------------
minimum time: 26.664 ms (2.08% GC)
median time: 27.197 ms (2.11% GC)
mean time: 27.486 ms (2.11% GC)
maximum time: 35.740 ms (1.64% GC)
--------------
samples: 182
evals/sample: 1
And you can see that it is around three times slower than the recommended answer. Also note that:
julia> by(data, :Type, :Value => mean)
2×2 DataFrame
│ Row │ Type │ Value_mean │
│ │ String │ Float64 │
├─────┼────────┼────────────┤
│ 1 │ B │ 5.50175 │
│ 2 │ A │ 5.49524 │
produces a nicer default name for the generated column (as it knows the source column name and the transformation function name).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With