Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Julia: What is the perfect way to convert a categorical array to a numeric array?

What is the perfect way to convert a categorical array to a simple numeric array? For example:

using CategoricalArrays
a = CategoricalArray(["X", "X", "Y", "Z", "Y", "Y", "Z"])
b = recode(a, "X"=>1, "Y"=>2, "Z"=>3)

As a result of the conversion, we still get a categorical array, even if we explicitly specify the type of assigned values:

b = recode(a, "X"=>1::Int64, "Y"=>2::Int64, "Z"=>3::Int64)

It looks like some other approach is needed here, but I can't think of a direction to look in

like image 529
Anton Degterev Avatar asked May 10 '21 11:05

Anton Degterev


3 Answers

You have two natural options:

julia> recode(unwrap.(a), "X"=>1, "Y"=>2, "Z"=>3)
7-element Vector{Int64}:
 1
 1
 2
 3
 2
 2
 3

or

julia> mapping = Dict("X"=>1, "Y"=>2, "Z"=>3)
Dict{String, Int64} with 3 entries:
  "Y" => 2
  "Z" => 3
  "X" => 1

julia> [mapping[v] for v in a]
7-element Vector{Int64}:
 1
 1
 2
 3
 2
 2
 3

the Dict approach is slower, but it is more flexible in case you would have many levels to map.

The key function here is unwrap that drops the "categorical" notion of CategoricalValue (in the Dict style unwrap gets called automatically)

Also note that if you just want to get the levelcodes of the values stored in a CategoricalArray (something that R does by default) then you can just do:

julia> levelcode.(a)
7-element Vector{Int64}:
 1
 1
 2
 3
 2
 2
 3

Also note that with levelcode missing is mapped to missing:

julia> x = CategoricalArray(["Y", "X", missing, "Z"])
4-element CategoricalArray{Union{Missing, String},1,UInt32}:
 "Y"
 "X"
 missing
 "Z"

julia> levelcode.(x)
4-element Vector{Union{Missing, Int64}}:
 2
 1
  missing
 3
like image 80
Bogumił Kamiński Avatar answered Nov 02 '22 08:11

Bogumił Kamiński


In addition to Bogumił's answers, a possible approach that should be quite fast is:

julia> b = recode!(similar(a, Int), a, "X"=>1, "Y"=>2, "Z"=>3)
7-element Vector{Int64}:
 1
 1
 2
 3
 2
 2
 3

like image 35
Milan Bouchet-Valat Avatar answered Nov 02 '22 09:11

Milan Bouchet-Valat


Bogumił's answer covers most of the question but I think it could be useful to add one more solution:

unwrap.(recode(a, "X"=>1, "Y"=>2, "Z"=>3))

As the length of the CategoricalArray grows relative to the number of categories, this solution becomes more performant than any of the other solutions (as of this moment) and seems like a very natural solution to me (it almost identical to OP's attempt). More importantly, the fact that it is more performant for these cases illustrates things about CategoricalArrays and what is actually happening when these functions are called.

By calling dump on a you can see the structure of this categorical array. Here is a simplified version:

CategoricalVector{String, UInt32, String, CategoricalValue{String, UInt32}, Union{}}
  refs: UInt32[0x00000001, 0x00000001, 0x00000002, 0x00000003, 0x00000002, 0x00000002, 0x00000003]
  pool: CategoricalPool{String, UInt32, CategoricalValue{String, UInt32}}
    levels: String["X","Y","Z"]
    invindex: Dict{String, UInt32}("Y" => 0x00000002, "Z" => 0x00000003, "X" => 0x00000001)

Each category is encoded as an UInt32. The encoded values are stored in the Vector refs. The CategoricalPool pool contains:

  • levels: a mapping from the level-code to the category (Vector{String}, the "key" is the index)
  • invindex: a mapping from the category to the level-code (Dict{String, UInt32})

This structure can very efficiently be recoded. In many cases, we could create a categorical array with new categories without touching the refs at all by just swapping out the pool part that describes the code:

mapping = Dict("X"=>1, "Y"=>2, "Z"=>3)
b = CategoricalArray{Int64,1,UInt32}(undef, 0)
b.refs = a.refs
levels!(b.pool, [mapping[l] for l in levels(a.pool)])

In the actual recode function a new empty categorical array of the same length as a is created and more edge cases are considered (perhaps most importantly the case when multiple categories are collapsed into one in the new code).

The broadcasted unwrap then consists of simple lookups in the pool that Julia is able to optimize very well.

Benchmarks

Bogumił Kamiński's first solution:

@btime recode(unwrap.($a), "X"=>1, "Y"=>2, "Z"=>3)

length btime result
100 1.268 μs (5 allocations: 1.84 KiB)
1000 14.872 μs (5 allocations: 15.97 KiB)
10000 151.881 μs (7 allocations: 156.50 KiB)

Bogumił Kamiński's second solution:

@btime [$mapping[v] for v in $a]

length btime result
100 2.439 μs (101 allocations: 4.00 KiB)
1000 23.715 μs (1001 allocations: 39.19 KiB)
10000 240.292 μs (10002 allocations: 390.70 KiB)

Milan Bouchet-Valat's solution:

@btime recode!(similar($a, Int), $a, "X"=>1, "Y"=>2, "Z"=>3)

length btime result
100 2.158 μs (104 allocations: 4.09 KiB)
1000 21.347 μs (1004 allocations: 39.28 KiB)
10000 208.035 μs (10005 allocations: 390.80 KiB)

This solution:

@btime unwrap.(recode($a, "X"=>1, "Y"=>2, "Z"=>3))

length btime result
100 2.360 μs (45 allocations: 4.56 KiB)
1000 4.420 μs (45 allocations: 15.20 KiB)
10000 20.212 μs (47 allocations: 120.55 KiB)
like image 28
ahnlabb Avatar answered Nov 02 '22 08:11

ahnlabb