Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Select rows of a DataFrame containing minimum of grouping variable in Julia

I'm wondering if there is an efficient way to do the following in Julia:

I have a DataFrame of the following form:

julia> df1 = DataFrame(var1=["a","a","a","b","b","b","c","c","c"],
                var2=["p","q","r","p","p","r","q","p","p"],
                var3=[1,2,3,2,5,4,6,7,8])
9×3 DataFrame
 Row  var1    var2    var3  
      String  String  Int64 
├─────┼────────┼────────┼───────┤
 1    a       p       1     
 2    a       q       2     
 3    a       r       3     
 4    b       p       2     
 5    b       p       5     
 6    b       r       4     
 7    c       q       6     
 8    c       p       7     
 9    c       p       8     

And I want to return a DataFrame that contains the same columns but only the rows where var3 has its minimum value within groups according to var1.

I have tried using the split-apply-combine approach but couldn't seem to find a way to filter the rows while returning all columns.

Appreciate any help on this.

like image 490
Kayvon Coffey Avatar asked Dec 30 '22 18:12

Kayvon Coffey


1 Answers

An alternative way to do it if you do not have duplicates in :var3 per group is:

julia> combine(sdf -> sdf[argmin(sdf.var3), :], groupby(df1, :var1))
3×3 DataFrame
 Row  var1    var2    var3
      String  String  Int64
─────┼───────────────────────
   1  a       p           1
   2  b       p           2
   3  c       q           6

If you may have duplicates then use:

julia> combine(sdf -> filter(:var3 => ==(minimum(sdf.var3)), sdf), groupby(df1, :var1))
3×3 DataFrame
 Row  var1    var2    var3
      String  String  Int64
─────┼───────────────────────
   1  a       p           1
   2  b       p           2
   3  c       q           6

instead.

Another example handling duplicates correctly is:

julia> combine(sdf -> first(groupby(sdf, :var3, sort=true)), groupby(df1, :var1))
3×3 DataFrame
 Row  var1    var2    var3
      String  String  Int64
─────┼───────────────────────
   1  a       p           1
   2  b       p           2
   3  c       q           6

it is not very efficient in this case but shows you how you can work with groupby in DataFrames.jl.

like image 130
Bogumił Kamiński Avatar answered Jan 14 '23 06:01

Bogumił Kamiński