I am trying to split the dataset into train and test subsets in Julia. So far, I have tried using MLDataUtils.jl package for this operation, however, the results are not up to the expectations. Below are my findings and issues:
Code
# the inputs are
a = DataFrame(A = [1, 2, 3, 4,5, 6, 7, 8, 9, 10],
              B = [1, 2, 3, 4,5, 6, 7, 8, 9, 10],
              C = [1, 2, 3, 4,5, 6, 7, 8, 9, 10]
             )
b = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
using MLDataUtils
(x1, y1), (x2, y2) = stratifiedobs((a,b), p=0.7)
#Output of this operation is: (which is not the expectation)
println("x1 is: $x1")
x1 is:
10×3 DataFrame
│ Row │ A     │ B     │ C     │
│     │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1   │ 1     │ 1     │ 1     │
│ 2   │ 2     │ 2     │ 2     │
│ 3   │ 3     │ 3     │ 3     │
│ 4   │ 4     │ 4     │ 4     │
│ 5   │ 5     │ 5     │ 5     │
│ 6   │ 6     │ 6     │ 6     │
│ 7   │ 7     │ 7     │ 7     │
│ 8   │ 8     │ 8     │ 8     │
│ 9   │ 9     │ 9     │ 9     │
│ 10  │ 10    │ 10    │ 10    │
println("y1 is: $y1")
y1 is:
10-element Array{Int64,1}:
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
# but x2 is printed as 
(0×3 SubDataFrame, Float64[]) 
# while y2 as 
0-element view(::Array{Float64,1}, Int64[]) with eltype Float64)
However, I would like this dataset to be split in 2 parts with 70% data in train and 30% in test. Please suggest a better approach to perform this operation in julia. Thanks in advance.
Probably MLJ.jl developers can show you how to do it using the general ecosystem. Here is a solution using DataFrames.jl only:
julia> using DataFrames, Random
julia> a = DataFrame(A = [1, 2, 3, 4,5, 6, 7, 8, 9, 10],
                     B = [1, 2, 3, 4,5, 6, 7, 8, 9, 10],
                     C = [1, 2, 3, 4,5, 6, 7, 8, 9, 10]
                    )
10×3 DataFrame
 Row │ A      B      C     
     │ Int64  Int64  Int64 
─────┼─────────────────────
   1 │     1      1      1
   2 │     2      2      2
   3 │     3      3      3
   4 │     4      4      4
   5 │     5      5      5
   6 │     6      6      6
   7 │     7      7      7
   8 │     8      8      8
   9 │     9      9      9
  10 │    10     10     10
julia> function splitdf(df, pct)
           @assert 0 <= pct <= 1
           ids = collect(axes(df, 1))
           shuffle!(ids)
           sel = ids .<= nrow(df) .* pct
           return view(df, sel, :), view(df, .!sel, :)
       end
splitdf (generic function with 1 method)
julia> splitdf(a, 0.7)
(7×3 SubDataFrame
 Row │ A      B      C     
     │ Int64  Int64  Int64 
─────┼─────────────────────
   1 │     3      3      3
   2 │     4      4      4
   3 │     6      6      6
   4 │     7      7      7
   5 │     8      8      8
   6 │     9      9      9
   7 │    10     10     10, 3×3 SubDataFrame
 Row │ A      B      C     
     │ Int64  Int64  Int64 
─────┼─────────────────────
   1 │     1      1      1
   2 │     2      2      2
   3 │     5      5      5)
I am using views to save memory, but alternatively you could just materialize train and test data frames if you prefer this.
This is how I did implement it for generic arrays in the Beta Machine Learning Toolkit:
"""
    partition(data,parts;shuffle=true)
Partition (by rows) one or more matrices according to the shares in `parts`.
# Parameters
* `data`: A matrix/vector or a vector of matrices/vectors
* `parts`: A vector of the required shares (must sum to 1)
* `shufle`: Wheter to randomly shuffle the matrices (preserving the relative order between matrices)
 """
function partition(data::AbstractArray{T,1},parts::AbstractArray{Float64,1};shuffle=true) where T <: AbstractArray
        n = size(data[1],1)
        if !all(size.(data,1) .== n)
            @error "All matrices passed to `partition` must have the same number of rows"
        end
        ridx = shuffle ? Random.shuffle(1:n) : collect(1:n)
        return partition.(data,Ref(parts);shuffle=shuffle, fixedRIdx = ridx)
end
function partition(data::AbstractArray{T,N} where N, parts::AbstractArray{Float64,1};shuffle=true,fixedRIdx=Int64[]) where T
    n = size(data,1)
    nParts = size(parts)
    toReturn = []
    if !(sum(parts) ≈ 1)
        @error "The sum of `parts` in `partition` should total to 1."
    end
    ridx = fixedRIdx
    if (isempty(ridx))
       ridx = shuffle ? Random.shuffle(1:n) : collect(1:n)
    end
    current = 1
    cumPart = 0.0
    for (i,p) in enumerate(parts)
        cumPart += parts[i]
        final = i == nParts ? n : Int64(round(cumPart*n))
        push!(toReturn,data[ridx[current:final],:])
        current = (final +=1)
    end
    return toReturn
end
Use it with:
julia> x = [1:10 11:20]
julia> y = collect(31:40)
julia> ((xtrain,xtest),(ytrain,ytest)) = partition([x,y],[0.7,0.3])
Ore that you can partition also in three or more parts, and the number of arrays to partition also is variable.
By default they are also shuffled, but you can avoid it with the parameter shuffle...
using Pkg Pkg.add("Lathe") using Lathe.preprocess: TrainTestSplit train, test = TrainTestSplit(df)
There is also a positional argument, at in the second position that takes a percentage to split at.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With