I'm using Julia's DataFrames.jl package. In it, I have a dataframe with a columns containing a list of strings (e.g. ["Type A", "Type B", "Type D"]). How does one then performs a one-hot encoding? I wasn't able to find a pre-built function in the DataFrames.jl package. Here is an example of what I want to do: <pre class="prettyprint"><code>Original Dataframe col1 | col2 | 102 |[a] | 103 |[a,b] | 102 |[c,b] | </code></pre> <pre class="prettyprint"><code>After One-hot encoding col1 | a | b | c | 102 | 1 | 0 | 0 | 103 | 1 | 1 | 0 | 102 | 0 | 1 | 1 | </code></pre>

I have included one hot function based on @Bogumil's code https://github.com/xiaodaigh/DataConvenience.jl#one-hot-encoding Just do <pre class="prettyprint"><code>onehot(df, :col2) </code></pre> Full MWE <pre class="prettyprint"><code>a = DataFrame( player1 = ["a", "b", "c"], player2 = ["d", "c", "a"] ) # does not modify a onehot(a, :player1) # modfies a onehot!(a, :player1) </code></pre>

There is indeed no one-hot encoding function in DataFrames.jl - I would argue that this is sensible, as this is a particular machine learning transformation that should live in a an ML package rather than in a basic DataFrames package. You've got two options I think: <ol> <li> Use an ML package that does this for you, e.g. MLJ.jl. In MLJ, the <code>OneHotEncoder</code> is a model that transforms any table with <code>Finite</code> features in it into a one-hot encoded version of itself, see the docs here </li> <li> Use a regression package that automatically generates dummy columns for categorical variables using the StatsModels <code>@formula</code> API - if you fit a regression with e.g. <code>GLM.jl</code> and your formula is <code>@formula(y ~ x)</code> where <code>x</code> is a a categorical variable, the model matrix will automatically be constructed by contrast coding <code>x</code>, i.e. having binary dummy columns for all but one level of <code>x</code> </li> </ol> For the second option, you ideally want your data to be categorical (although strings will work as well), and for this DataFrames.jl includes the <code>categorical!</code> function. EDIT 17/11/2021: There has since been a definitive thread on this on the Julia Discourse which contains an extensive list of suggestions for doing one-hot encoding: https://discourse.julialang.org/t/all-the-ways-to-do-one-hot-encoding/ Sharing my favourite from there: <pre class="prettyprint"><code>julia> x = [1, 2, 1, 3, 2]; julia> unique(x) .== permutedims(x) 3×5 BitMatrix: 1 0 1 0 0 0 1 0 0 1 0 0 0 1 0 </code></pre>

Julia DataFrames - How to do one-hot encoding?

Tags:

dataframe

one-hot-encoding

julia

I'm using Julia's DataFrames.jl package. In it, I have a dataframe with a columns containing a list of strings (e.g. ["Type A", "Type B", "Type D"]). How does one then performs a one-hot encoding? I wasn't able to find a pre-built function in the DataFrames.jl package.

Here is an example of what I want to do:

Original Dataframe

col1 | col2 |
102  |[a]   |
103  |[a,b] | 
102  |[c,b] |

After One-hot encoding

col1 | a | b | c |
102  | 1 | 0 | 0 |
103  | 1 | 1 | 0 | 
102  | 0 | 1 | 1 |

342

asked Oct 28 '20 01:10

Davi Barreira

3 Answers

It is easy enough to do it with basic functions we provide though:

julia> df = DataFrame(x=rand([1:3;missing], 20))
20×1 DataFrame
│ Row │ x       │
│     │ Int64?  │
├─────┼─────────┤
│ 1   │ 1       │
│ 2   │ 2       │
│ 3   │ missing │
│ 4   │ 1       │
│ 5   │ 3       │
│ 6   │ missing │
│ 7   │ 3       │
│ 8   │ 3       │
│ 9   │ 3       │
│ 10  │ 3       │
│ 11  │ missing │
│ 12  │ 1       │
│ 13  │ 3       │
│ 14  │ 3       │
│ 15  │ 3       │
│ 16  │ 1       │
│ 17  │ missing │
│ 18  │ 1       │
│ 19  │ 1       │
│ 20  │ missing │

julia> ux = unique(df.x); transform(df, @. :x => ByRow(isequal(ux)) .=> Symbol(:x_, ux))
20×5 DataFrame
│ Row │ x       │ x_1  │ x_2  │ x_missing │ x_3  │
│     │ Int64?  │ Bool │ Bool │ Bool      │ Bool │
├─────┼─────────┼──────┼──────┼───────────┼──────┤
│ 1   │ 1       │ 1    │ 0    │ 0         │ 0    │
│ 2   │ 2       │ 0    │ 1    │ 0         │ 0    │
│ 3   │ missing │ 0    │ 0    │ 1         │ 0    │
│ 4   │ 1       │ 1    │ 0    │ 0         │ 0    │
│ 5   │ 3       │ 0    │ 0    │ 0         │ 1    │
│ 6   │ missing │ 0    │ 0    │ 1         │ 0    │
│ 7   │ 3       │ 0    │ 0    │ 0         │ 1    │
│ 8   │ 3       │ 0    │ 0    │ 0         │ 1    │
│ 9   │ 3       │ 0    │ 0    │ 0         │ 1    │
│ 10  │ 3       │ 0    │ 0    │ 0         │ 1    │
│ 11  │ missing │ 0    │ 0    │ 1         │ 0    │
│ 12  │ 1       │ 1    │ 0    │ 0         │ 0    │
│ 13  │ 3       │ 0    │ 0    │ 0         │ 1    │
│ 14  │ 3       │ 0    │ 0    │ 0         │ 1    │
│ 15  │ 3       │ 0    │ 0    │ 0         │ 1    │
│ 16  │ 1       │ 1    │ 0    │ 0         │ 0    │
│ 17  │ missing │ 0    │ 0    │ 1         │ 0    │
│ 18  │ 1       │ 1    │ 0    │ 0         │ 0    │
│ 19  │ 1       │ 1    │ 0    │ 0         │ 0    │
│ 20  │ missing │ 0    │ 0    │ 1         │ 0    │

EDIT:

Another example:

julia> df = DataFrame(col1=102:104, col2=[["a"], ["a","b"], ["c","b"]])
3×2 DataFrame
│ Row │ col1  │ col2       │
│     │ Int64 │ Array…     │
├─────┼───────┼────────────┤
│ 1   │ 102   │ ["a"]      │
│ 2   │ 103   │ ["a", "b"] │
│ 3   │ 104   │ ["c", "b"] │

julia> ux = unique(reduce(vcat, df.col2))
3-element Array{String,1}:
 "a"
 "b"
 "c"

julia> transform(df, :col2 .=> [ByRow(v -> x in v) for x in ux] .=> Symbol.(:col2_, ux))
3×5 DataFrame
│ Row │ col1  │ col2       │ col2_a │ col2_b │ col2_c │
│     │ Int64 │ Array…     │ Bool   │ Bool   │ Bool   │
├─────┼───────┼────────────┼────────┼────────┼────────┤
│ 1   │ 102   │ ["a"]      │ 1      │ 0      │ 0      │
│ 2   │ 103   │ ["a", "b"] │ 1      │ 1      │ 0      │
│ 3   │ 104   │ ["c", "b"] │ 0      │ 1      │ 1      │

answered Oct 17 '22 06:10

Bogumił Kamiński

I have included one hot function based on @Bogumil's code

https://github.com/xiaodaigh/DataConvenience.jl#one-hot-encoding

Just do

onehot(df, :col2)

Full MWE

a = DataFrame(
  player1 = ["a", "b", "c"],
  player2 = ["d", "c", "a"]
)

# does not modify a
onehot(a, :player1)

# modfies a
onehot!(a, :player1)

answered Oct 17 '22 06:10

xiaodai

You've got two options I think:

Use an ML package that does this for you, e.g. MLJ.jl. In MLJ, the OneHotEncoder is a model that transforms any table with Finite features in it into a one-hot encoded version of itself, see the docs here
Use a regression package that automatically generates dummy columns for categorical variables using the StatsModels @formula API - if you fit a regression with e.g. GLM.jl and your formula is @formula(y ~ x) where x is a a categorical variable, the model matrix will automatically be constructed by contrast coding x, i.e. having binary dummy columns for all but one level of x

For the second option, you ideally want your data to be categorical (although strings will work as well), and for this DataFrames.jl includes the categorical! function.

EDIT 17/11/2021: There has since been a definitive thread on this on the Julia Discourse which contains an extensive list of suggestions for doing one-hot encoding: https://discourse.julialang.org/t/all-the-ways-to-do-one-hot-encoding/

Sharing my favourite from there:

julia> x = [1, 2, 1, 3, 2];

julia> unique(x) .== permutedims(x)
3×5 BitMatrix:
 1  0  1  0  0
 0  1  0  0  1
 0  0  0  1  0

answered Oct 17 '22 06:10

Nils Gudat

Related questions
                            
                                Why is pandas.DataFrame.apply printing out junk?
                            
                                pandas dataframe check if index exists in a multi index
                            
                                How to fill nan values with rolling mean in pandas
                            
                                R: as.numeric function not returning correct # from data.frame [duplicate]
                            
                                From password-protected Excel file to pandas DataFrame
                            
                                Session generation from log file analysis with pandas
                            
                                pandas- adding a series to a dataframe causes NaN values to appear
                            
                                SPARK : failure: ``union'' expected but `(' found
                            
                                Grouping and counting to get a closerate
                            
                                Groupby on pandas dataframe and concatenate strings with comma based on the frequency of values in a column
                            
                                Replace a column with another column if another is not null in pandas DataFrame
                            
                                Creating a new column from two columns with apply()
                            
                                How to subset data.frame by weeks and then sum?
                            
                                How do I add random `NA`s into a data frame
                            
                                Check if a variable is xts or data.frame
                            
                                Return a data frame from function
                            
                                initialize pandas DataFrame with defined dtypes
                            
                                Pandas: how to increment a column's cell value based on a list of ids
                            
                                Pandas - Merge rows and add columns with 'get_dummies'
                            
                                Create a new column only if values differ

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Julia DataFrames - How to do one-hot encoding?

Tags:

dataframe

one-hot-encoding

julia

Davi Barreira

People also ask

3 Answers

Bogumił Kamiński

xiaodai

Nils Gudat

Recent Activity

Donate For Us