Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Julia DataFrames - How to do one-hot encoding?

I'm using Julia's DataFrames.jl package. In it, I have a dataframe with a columns containing a list of strings (e.g. ["Type A", "Type B", "Type D"]). How does one then performs a one-hot encoding? I wasn't able to find a pre-built function in the DataFrames.jl package.

Here is an example of what I want to do:

Original Dataframe

col1 | col2 |
102  |[a]   |
103  |[a,b] | 
102  |[c,b] |
After One-hot encoding

col1 | a | b | c |
102  | 1 | 0 | 0 |
103  | 1 | 1 | 0 | 
102  | 0 | 1 | 1 |
like image 342
Davi Barreira Avatar asked Oct 28 '20 01:10

Davi Barreira


People also ask

What is one hot encoding in pandas?

Pandas — One Hot Encoding (OHE). Pandas Dataframe Examples: AI Secrets—… | by J3 | Jungletronics | Medium Hi, this post deals with make categorical data numerical in a Data set for application of machine learning algorithms. (Colab File link :) In machine learning one-hot encoding is a frequently used method to deal with categorical data.

Is there a one-hot encoding function in DataFrames?

There is indeed no one-hot encoding function in DataFrames.jl - I would argue that this is sensible, as this is a particular machine learning transformation that should live in a an ML package rather than in a basic DataFrames package. Use an ML package that does this for you, e.g. MLJ.jl.

What is one hot encoding in machine learning?

One-hot encoding is the process by which categorical data are converted into numerical data for use in machine learning. Categorical features are turned into binary features that are “one-hot” encoded, meaning that if a feature is represented by that column, it receives a 1.

What is hot encoding in SQL Server?

One Hot Encoding –. It refers to splitting the column which contains numerical categorical data to many columns depending on the number of categories present in that column. Each column contains “0” or “1” corresponding to which column it has been placed.


3 Answers

It is easy enough to do it with basic functions we provide though:

julia> df = DataFrame(x=rand([1:3;missing], 20))
20×1 DataFrame
│ Row │ x       │
│     │ Int64?  │
├─────┼─────────┤
│ 1   │ 1       │
│ 2   │ 2       │
│ 3   │ missing │
│ 4   │ 1       │
│ 5   │ 3       │
│ 6   │ missing │
│ 7   │ 3       │
│ 8   │ 3       │
│ 9   │ 3       │
│ 10  │ 3       │
│ 11  │ missing │
│ 12  │ 1       │
│ 13  │ 3       │
│ 14  │ 3       │
│ 15  │ 3       │
│ 16  │ 1       │
│ 17  │ missing │
│ 18  │ 1       │
│ 19  │ 1       │
│ 20  │ missing │

julia> ux = unique(df.x); transform(df, @. :x => ByRow(isequal(ux)) .=> Symbol(:x_, ux))
20×5 DataFrame
│ Row │ x       │ x_1  │ x_2  │ x_missing │ x_3  │
│     │ Int64?  │ Bool │ Bool │ Bool      │ Bool │
├─────┼─────────┼──────┼──────┼───────────┼──────┤
│ 1   │ 1       │ 1    │ 0    │ 0         │ 0    │
│ 2   │ 2       │ 0    │ 1    │ 0         │ 0    │
│ 3   │ missing │ 0    │ 0    │ 1         │ 0    │
│ 4   │ 1       │ 1    │ 0    │ 0         │ 0    │
│ 5   │ 3       │ 0    │ 0    │ 0         │ 1    │
│ 6   │ missing │ 0    │ 0    │ 1         │ 0    │
│ 7   │ 3       │ 0    │ 0    │ 0         │ 1    │
│ 8   │ 3       │ 0    │ 0    │ 0         │ 1    │
│ 9   │ 3       │ 0    │ 0    │ 0         │ 1    │
│ 10  │ 3       │ 0    │ 0    │ 0         │ 1    │
│ 11  │ missing │ 0    │ 0    │ 1         │ 0    │
│ 12  │ 1       │ 1    │ 0    │ 0         │ 0    │
│ 13  │ 3       │ 0    │ 0    │ 0         │ 1    │
│ 14  │ 3       │ 0    │ 0    │ 0         │ 1    │
│ 15  │ 3       │ 0    │ 0    │ 0         │ 1    │
│ 16  │ 1       │ 1    │ 0    │ 0         │ 0    │
│ 17  │ missing │ 0    │ 0    │ 1         │ 0    │
│ 18  │ 1       │ 1    │ 0    │ 0         │ 0    │
│ 19  │ 1       │ 1    │ 0    │ 0         │ 0    │
│ 20  │ missing │ 0    │ 0    │ 1         │ 0    │

EDIT:

Another example:

julia> df = DataFrame(col1=102:104, col2=[["a"], ["a","b"], ["c","b"]])
3×2 DataFrame
│ Row │ col1  │ col2       │
│     │ Int64 │ Array…     │
├─────┼───────┼────────────┤
│ 1   │ 102   │ ["a"]      │
│ 2   │ 103   │ ["a", "b"] │
│ 3   │ 104   │ ["c", "b"] │

julia> ux = unique(reduce(vcat, df.col2))
3-element Array{String,1}:
 "a"
 "b"
 "c"

julia> transform(df, :col2 .=> [ByRow(v -> x in v) for x in ux] .=> Symbol.(:col2_, ux))
3×5 DataFrame
│ Row │ col1  │ col2       │ col2_a │ col2_b │ col2_c │
│     │ Int64 │ Array…     │ Bool   │ Bool   │ Bool   │
├─────┼───────┼────────────┼────────┼────────┼────────┤
│ 1   │ 102   │ ["a"]      │ 1      │ 0      │ 0      │
│ 2   │ 103   │ ["a", "b"] │ 1      │ 1      │ 0      │
│ 3   │ 104   │ ["c", "b"] │ 0      │ 1      │ 1      │
like image 59
Bogumił Kamiński Avatar answered Oct 17 '22 06:10

Bogumił Kamiński


I have included one hot function based on @Bogumil's code

https://github.com/xiaodaigh/DataConvenience.jl#one-hot-encoding

Just do

onehot(df, :col2)

Full MWE

a = DataFrame(
  player1 = ["a", "b", "c"],
  player2 = ["d", "c", "a"]
)

# does not modify a
onehot(a, :player1)

# modfies a
onehot!(a, :player1)
like image 31
xiaodai Avatar answered Oct 17 '22 06:10

xiaodai


There is indeed no one-hot encoding function in DataFrames.jl - I would argue that this is sensible, as this is a particular machine learning transformation that should live in a an ML package rather than in a basic DataFrames package.

You've got two options I think:

  1. Use an ML package that does this for you, e.g. MLJ.jl. In MLJ, the OneHotEncoder is a model that transforms any table with Finite features in it into a one-hot encoded version of itself, see the docs here

  2. Use a regression package that automatically generates dummy columns for categorical variables using the StatsModels @formula API - if you fit a regression with e.g. GLM.jl and your formula is @formula(y ~ x) where x is a a categorical variable, the model matrix will automatically be constructed by contrast coding x, i.e. having binary dummy columns for all but one level of x

For the second option, you ideally want your data to be categorical (although strings will work as well), and for this DataFrames.jl includes the categorical! function.

EDIT 17/11/2021: There has since been a definitive thread on this on the Julia Discourse which contains an extensive list of suggestions for doing one-hot encoding: https://discourse.julialang.org/t/all-the-ways-to-do-one-hot-encoding/

Sharing my favourite from there:

julia> x = [1, 2, 1, 3, 2];

julia> unique(x) .== permutedims(x)
3×5 BitMatrix:
 1  0  1  0  0
 0  1  0  0  1
 0  0  0  1  0
like image 32
Nils Gudat Avatar answered Oct 17 '22 06:10

Nils Gudat