Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Splitting datasets into train and test in julia

I am trying to split the dataset into train and test subsets in Julia. So far, I have tried using MLDataUtils.jl package for this operation, however, the results are not up to the expectations. Below are my findings and issues:

Code

# the inputs are

a = DataFrame(A = [1, 2, 3, 4,5, 6, 7, 8, 9, 10],
              B = [1, 2, 3, 4,5, 6, 7, 8, 9, 10],
              C = [1, 2, 3, 4,5, 6, 7, 8, 9, 10]
             )
b = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

using MLDataUtils
(x1, y1), (x2, y2) = stratifiedobs((a,b), p=0.7)

#Output of this operation is: (which is not the expectation)
println("x1 is: $x1")
x1 is:
10×3 DataFrame
│ Row │ A     │ B     │ C     │
│     │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1   │ 1     │ 1     │ 1     │
│ 2   │ 2     │ 2     │ 2     │
│ 3   │ 3     │ 3     │ 3     │
│ 4   │ 4     │ 4     │ 4     │
│ 5   │ 5     │ 5     │ 5     │
│ 6   │ 6     │ 6     │ 6     │
│ 7   │ 7     │ 7     │ 7     │
│ 8   │ 8     │ 8     │ 8     │
│ 9   │ 9     │ 9     │ 9     │
│ 10  │ 10    │ 10    │ 10    │

println("y1 is: $y1")
y1 is:
10-element Array{Int64,1}:
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10

# but x2 is printed as 
(0×3 SubDataFrame, Float64[]) 

# while y2 as 
0-element view(::Array{Float64,1}, Int64[]) with eltype Float64)

However, I would like this dataset to be split in 2 parts with 70% data in train and 30% in test. Please suggest a better approach to perform this operation in julia. Thanks in advance.

like image 255
Mohammad Saad Avatar asked Feb 05 '21 07:02

Mohammad Saad


Video Answer


3 Answers

Probably MLJ.jl developers can show you how to do it using the general ecosystem. Here is a solution using DataFrames.jl only:

julia> using DataFrames, Random

julia> a = DataFrame(A = [1, 2, 3, 4,5, 6, 7, 8, 9, 10],
                     B = [1, 2, 3, 4,5, 6, 7, 8, 9, 10],
                     C = [1, 2, 3, 4,5, 6, 7, 8, 9, 10]
                    )
10×3 DataFrame
 Row │ A      B      C     
     │ Int64  Int64  Int64 
─────┼─────────────────────
   1 │     1      1      1
   2 │     2      2      2
   3 │     3      3      3
   4 │     4      4      4
   5 │     5      5      5
   6 │     6      6      6
   7 │     7      7      7
   8 │     8      8      8
   9 │     9      9      9
  10 │    10     10     10

julia> function splitdf(df, pct)
           @assert 0 <= pct <= 1
           ids = collect(axes(df, 1))
           shuffle!(ids)
           sel = ids .<= nrow(df) .* pct
           return view(df, sel, :), view(df, .!sel, :)
       end
splitdf (generic function with 1 method)

julia> splitdf(a, 0.7)
(7×3 SubDataFrame
 Row │ A      B      C     
     │ Int64  Int64  Int64 
─────┼─────────────────────
   1 │     3      3      3
   2 │     4      4      4
   3 │     6      6      6
   4 │     7      7      7
   5 │     8      8      8
   6 │     9      9      9
   7 │    10     10     10, 3×3 SubDataFrame
 Row │ A      B      C     
     │ Int64  Int64  Int64 
─────┼─────────────────────
   1 │     1      1      1
   2 │     2      2      2
   3 │     5      5      5)

I am using views to save memory, but alternatively you could just materialize train and test data frames if you prefer this.

like image 89
Bogumił Kamiński Avatar answered Oct 16 '22 13:10

Bogumił Kamiński


This is how I did implement it for generic arrays in the Beta Machine Learning Toolkit:

"""
    partition(data,parts;shuffle=true)
Partition (by rows) one or more matrices according to the shares in `parts`.
# Parameters
* `data`: A matrix/vector or a vector of matrices/vectors
* `parts`: A vector of the required shares (must sum to 1)
* `shufle`: Wheter to randomly shuffle the matrices (preserving the relative order between matrices)
 """
function partition(data::AbstractArray{T,1},parts::AbstractArray{Float64,1};shuffle=true) where T <: AbstractArray
        n = size(data[1],1)
        if !all(size.(data,1) .== n)
            @error "All matrices passed to `partition` must have the same number of rows"
        end
        ridx = shuffle ? Random.shuffle(1:n) : collect(1:n)
        return partition.(data,Ref(parts);shuffle=shuffle, fixedRIdx = ridx)
end

function partition(data::AbstractArray{T,N} where N, parts::AbstractArray{Float64,1};shuffle=true,fixedRIdx=Int64[]) where T
    n = size(data,1)
    nParts = size(parts)
    toReturn = []
    if !(sum(parts) ≈ 1)
        @error "The sum of `parts` in `partition` should total to 1."
    end
    ridx = fixedRIdx
    if (isempty(ridx))
       ridx = shuffle ? Random.shuffle(1:n) : collect(1:n)
    end
    current = 1
    cumPart = 0.0
    for (i,p) in enumerate(parts)
        cumPart += parts[i]
        final = i == nParts ? n : Int64(round(cumPart*n))
        push!(toReturn,data[ridx[current:final],:])
        current = (final +=1)
    end
    return toReturn
end

Use it with:

julia> x = [1:10 11:20]
julia> y = collect(31:40)
julia> ((xtrain,xtest),(ytrain,ytest)) = partition([x,y],[0.7,0.3])

Ore that you can partition also in three or more parts, and the number of arrays to partition also is variable.

By default they are also shuffled, but you can avoid it with the parameter shuffle...

like image 44
Antonello Avatar answered Oct 16 '22 12:10

Antonello


using Pkg Pkg.add("Lathe") using Lathe.preprocess: TrainTestSplit train, test = TrainTestSplit(df) There is also a positional argument, at in the second position that takes a percentage to split at.

like image 1
Emmett Boudreau Avatar answered Oct 16 '22 12:10

Emmett Boudreau