I'm wondering if there is an efficient way to do the following in Julia:
I have a DataFrame of the following form:
julia> df1 = DataFrame(var1=["a","a","a","b","b","b","c","c","c"],
var2=["p","q","r","p","p","r","q","p","p"],
var3=[1,2,3,2,5,4,6,7,8])
9×3 DataFrame
│ Row │ var1 │ var2 │ var3 │
│ │ String │ String │ Int64 │
├─────┼────────┼────────┼───────┤
│ 1 │ a │ p │ 1 │
│ 2 │ a │ q │ 2 │
│ 3 │ a │ r │ 3 │
│ 4 │ b │ p │ 2 │
│ 5 │ b │ p │ 5 │
│ 6 │ b │ r │ 4 │
│ 7 │ c │ q │ 6 │
│ 8 │ c │ p │ 7 │
│ 9 │ c │ p │ 8 │
And I want to return a DataFrame that contains the same columns but only the rows where var3
has its minimum value within groups according to var1
.
I have tried using the split-apply-combine approach but couldn't seem to find a way to filter the rows while returning all columns.
Appreciate any help on this.
An alternative way to do it if you do not have duplicates in :var3
per group is:
julia> combine(sdf -> sdf[argmin(sdf.var3), :], groupby(df1, :var1))
3×3 DataFrame
Row │ var1 var2 var3
│ String String Int64
─────┼───────────────────────
1 │ a p 1
2 │ b p 2
3 │ c q 6
If you may have duplicates then use:
julia> combine(sdf -> filter(:var3 => ==(minimum(sdf.var3)), sdf), groupby(df1, :var1))
3×3 DataFrame
Row │ var1 var2 var3
│ String String Int64
─────┼───────────────────────
1 │ a p 1
2 │ b p 2
3 │ c q 6
instead.
Another example handling duplicates correctly is:
julia> combine(sdf -> first(groupby(sdf, :var3, sort=true)), groupby(df1, :var1))
3×3 DataFrame
Row │ var1 var2 var3
│ String String Int64
─────┼───────────────────────
1 │ a p 1
2 │ b p 2
3 │ c q 6
it is not very efficient in this case but shows you how you can work with groupby
in DataFrames.jl.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With