I'm trying to apply a custom sorting algorithm to a bunch of subdataframes in order to make some plots. With the help of this question, I'm able to sort my dataframe with a custom order:
julia> using DataFrames
julia> df = DataFrame(x = rand(10), y = rand([:low, :med, :high], 10), z = rand([:a, :b], 10))
10×3 DataFrames.DataFrame
│ Row │ x │ y │ z │
├─────┼───────────┼──────┼───┤
│ 1 │ 0.436891 │ low │ b │
│ 2 │ 0.370725 │ high │ b │
│ 3 │ 0.521269 │ low │ b │
│ 4 │ 0.071102 │ high │ a │
│ 5 │ 0.969407 │ high │ a │
│ 6 │ 0.0416023 │ med │ b │
│ 7 │ 0.63486 │ med │ b │
│ 8 │ 0.4352 │ high │ b │
│ 9 │ 0.626739 │ low │ b │
│ 10 │ 0.151149 │ low │ a │
julia> o = [:low, :med, :high]
3-element Array{Symbol,1}:
:low
:med
:high
julia> custom_sort(x,y) = findfirst(o, x) < findfirst(o, y)
custom_sort (generic function with 1 method)
julia> sort!(df, cols=[:y], lt=custom_sort)
10×3 DataFrames.DataFrame
│ Row │ x │ y │ z │
├─────┼───────────┼──────┼───┤
│ 1 │ 0.436891 │ low │ b │
│ 2 │ 0.521269 │ low │ b │
│ 3 │ 0.626739 │ low │ b │
│ 4 │ 0.151149 │ low │ a │
│ 5 │ 0.0416023 │ med │ b │
│ 6 │ 0.63486 │ med │ b │
│ 7 │ 0.370725 │ high │ b │
│ 8 │ 0.071102 │ high │ a │
│ 9 │ 0.969407 │ high │ a │
│ 10 │ 0.4352 │ high │ b │
and it works great. The trouble is, when I then do a groupby(), the custom sorting gets lost:
julia> groupby(df, [:y, :z])
DataFrames.GroupedDataFrame 5 groups with keys: Symbol[:y, :z]
First Group:
2×3 DataFrames.SubDataFrame{Array{Int64,1}}
│ Row │ x │ y │ z │
├─────┼──────────┼──────┼───┤
│ 1 │ 0.071102 │ high │ a │
│ 2 │ 0.969407 │ high │ a │
⋮
Last Group:
2×3 DataFrames.SubDataFrame{Array{Int64,1}}
│ Row │ x │ y │ z │
├─────┼───────────┼─────┼───┤
│ 1 │ 0.0416023 │ med │ b │
│ 2 │ 0.63486 │ med │ b │
Is there a way I can sort the SubDataFrames so that eg. the first group is has y == :low and z == a?
groupby takes advantage of the PooledArray machinery to split the DataFrame in to groups. When creating a PooledArray out of a vector the order is not kept... unless specified in the PooledArray constructor. It is possible to trick groupby by making the columns already into PooledArrays with a desired order. In code:
julia> df[:y] = PooledDataArray(df[:y],[:low,:med,:high])
julia> df[:z] = PooledDataArray(df[:z],[:a,:b])
julia> groupby(df, [:y, :z])
DataFrames.GroupedDataFrame 6 groups with keys: Symbol[:y, :z]
First Group:
1×3 DataFrames.SubDataFrame{Array{Int64,1}}
│ Row │ x │ y │ z │
├─────┼──────────┼─────┼───┤
│ 1 │ 0.833255 │ low │ a │
⋮
Last Group:
1×3 DataFrames.SubDataFrame{Array{Int64,1}}
│ Row │ x │ y │ z │
├─────┼──────────┼──────┼───┤
│ 1 │ 0.604117 │ high │ b │
This can also be automated for more columns or columns with more values with the following loop:
for v in [:y,:z]
df[v] = PooledDataArray(df[v],unique(Vector(df[v])))
end
which does the same as the explicit assignments earlier.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With