Julia: What is the perfect way to convert a categorical array to a numeric array?

Question

What is the perfect way to convert a categorical array to a simple numeric array? For example:

using CategoricalArrays
a = CategoricalArray(["X", "X", "Y", "Z", "Y", "Y", "Z"])
b = recode(a, "X"=>1, "Y"=>2, "Z"=>3)

As a result of the conversion, we still get a categorical array, even if we explicitly specify the type of assigned values:

b = recode(a, "X"=>1::Int64, "Y"=>2::Int64, "Z"=>3::Int64)

It looks like some other approach is needed here, but I can't think of a direction to look in

Bogumił Kamiński · Accepted Answer

You have two natural options:

julia> recode(unwrap.(a), "X"=>1, "Y"=>2, "Z"=>3)
7-element Vector{Int64}:
 1
 1
 2
 3
 2
 2
 3

or

julia> mapping = Dict("X"=>1, "Y"=>2, "Z"=>3)
Dict{String, Int64} with 3 entries:
  "Y" => 2
  "Z" => 3
  "X" => 1

julia> [mapping[v] for v in a]
7-element Vector{Int64}:
 1
 1
 2
 3
 2
 2
 3

the Dict approach is slower, but it is more flexible in case you would have many levels to map.

The key function here is unwrap that drops the "categorical" notion of CategoricalValue (in the Dict style unwrap gets called automatically)

Also note that if you just want to get the levelcodes of the values stored in a CategoricalArray (something that R does by default) then you can just do:

julia> levelcode.(a)
7-element Vector{Int64}:
 1
 1
 2
 3
 2
 2
 3

Also note that with levelcode missing is mapped to missing:

julia> x = CategoricalArray(["Y", "X", missing, "Z"])
4-element CategoricalArray{Union{Missing, String},1,UInt32}:
 "Y"
 "X"
 missing
 "Z"

julia> levelcode.(x)
4-element Vector{Union{Missing, Int64}}:
 2
 1
  missing
 3

Milan Bouchet-Valat · Answer

In addition to Bogumił's answers, a possible approach that should be quite fast is:

julia> b = recode!(similar(a, Int), a, "X"=>1, "Y"=>2, "Z"=>3)
7-element Vector{Int64}:
 1
 1
 2
 3
 2
 2
 3

ahnlabb · Answer

Bogumił's answer covers most of the question but I think it could be useful to add one more solution:

unwrap.(recode(a, "X"=>1, "Y"=>2, "Z"=>3))

As the length of the CategoricalArray grows relative to the number of categories, this solution becomes more performant than any of the other solutions (as of this moment) and seems like a very natural solution to me (it almost identical to OP's attempt). More importantly, the fact that it is more performant for these cases illustrates things about CategoricalArrays and what is actually happening when these functions are called.

By calling dump on a you can see the structure of this categorical array. Here is a simplified version:

CategoricalVector{String, UInt32, String, CategoricalValue{String, UInt32}, Union{}}
  refs: UInt32[0x00000001, 0x00000001, 0x00000002, 0x00000003, 0x00000002, 0x00000002, 0x00000003]
  pool: CategoricalPool{String, UInt32, CategoricalValue{String, UInt32}}
    levels: String["X","Y","Z"]
    invindex: Dict{String, UInt32}("Y" => 0x00000002, "Z" => 0x00000003, "X" => 0x00000001)

Each category is encoded as an UInt32. The encoded values are stored in the Vector refs. The CategoricalPool pool contains:

levels: a mapping from the level-code to the category (Vector{String}, the "key" is the index)
invindex: a mapping from the category to the level-code (Dict{String, UInt32})

This structure can very efficiently be recoded. In many cases, we could create a categorical array with new categories without touching the refs at all by just swapping out the pool part that describes the code:

mapping = Dict("X"=>1, "Y"=>2, "Z"=>3)
b = CategoricalArray{Int64,1,UInt32}(undef, 0)
b.refs = a.refs
levels!(b.pool, [mapping[l] for l in levels(a.pool)])

In the actual recode function a new empty categorical array of the same length as a is created and more edge cases are considered (perhaps most importantly the case when multiple categories are collapsed into one in the new code).

The broadcasted unwrap then consists of simple lookups in the pool that Julia is able to optimize very well.

Benchmarks

Bogumił Kamiński's first solution:

@btime recode(unwrap.($a), "X"=>1, "Y"=>2, "Z"=>3)

length	btime result
100	1.268 μs (5 allocations: 1.84 KiB)
1000	14.872 μs (5 allocations: 15.97 KiB)
10000	151.881 μs (7 allocations: 156.50 KiB)

Bogumił Kamiński's second solution:

@btime [$mapping[v] for v in $a]

length	btime result
100	2.439 μs (101 allocations: 4.00 KiB)
1000	23.715 μs (1001 allocations: 39.19 KiB)
10000	240.292 μs (10002 allocations: 390.70 KiB)

Milan Bouchet-Valat's solution:

@btime recode!(similar($a, Int), $a, "X"=>1, "Y"=>2, "Z"=>3)

length	btime result
100	2.158 μs (104 allocations: 4.09 KiB)
1000	21.347 μs (1004 allocations: 39.28 KiB)
10000	208.035 μs (10005 allocations: 390.80 KiB)

This solution:

@btime unwrap.(recode($a, "X"=>1, "Y"=>2, "Z"=>3))

length	btime result
100	2.360 μs (45 allocations: 4.56 KiB)
1000	4.420 μs (45 allocations: 15.20 KiB)
10000	20.212 μs (47 allocations: 120.55 KiB)

Julia: What is the perfect way to convert a categorical array to a numeric array?

Tags:

arrays

julia

categorical-data

Anton Degterev

3 Answers

Bogumił Kamiński

Milan Bouchet-Valat

Benchmarks

ahnlabb

Recent Activity

Donate For Us

Julia: What is the perfect way to convert a categorical array to a numeric array?

Tags:

arrays

julia

categorical-data

Anton Degterev

3 Answers

Bogumił Kamiński

Milan Bouchet-Valat

Benchmarks

ahnlabb

Related questions

Recent Activity

Donate For Us