What is the perfect way to convert a categorical array to a simple numeric array? For example:
using CategoricalArrays
a = CategoricalArray(["X", "X", "Y", "Z", "Y", "Y", "Z"])
b = recode(a, "X"=>1, "Y"=>2, "Z"=>3)
As a result of the conversion, we still get a categorical array, even if we explicitly specify the type of assigned values:
b = recode(a, "X"=>1::Int64, "Y"=>2::Int64, "Z"=>3::Int64)
It looks like some other approach is needed here, but I can't think of a direction to look in
You have two natural options:
julia> recode(unwrap.(a), "X"=>1, "Y"=>2, "Z"=>3)
7-element Vector{Int64}:
1
1
2
3
2
2
3
or
julia> mapping = Dict("X"=>1, "Y"=>2, "Z"=>3)
Dict{String, Int64} with 3 entries:
"Y" => 2
"Z" => 3
"X" => 1
julia> [mapping[v] for v in a]
7-element Vector{Int64}:
1
1
2
3
2
2
3
the Dict approach is slower, but it is more flexible in case you would have many levels to map.
The key function here is unwrap that drops the "categorical" notion of CategoricalValue (in the Dict style unwrap gets called automatically)
Also note that if you just want to get the levelcodes of the values stored in a CategoricalArray (something that R does by default) then you can just do:
julia> levelcode.(a)
7-element Vector{Int64}:
1
1
2
3
2
2
3
Also note that with levelcode missing is mapped to missing:
julia> x = CategoricalArray(["Y", "X", missing, "Z"])
4-element CategoricalArray{Union{Missing, String},1,UInt32}:
"Y"
"X"
missing
"Z"
julia> levelcode.(x)
4-element Vector{Union{Missing, Int64}}:
2
1
missing
3
In addition to Bogumił's answers, a possible approach that should be quite fast is:
julia> b = recode!(similar(a, Int), a, "X"=>1, "Y"=>2, "Z"=>3)
7-element Vector{Int64}:
1
1
2
3
2
2
3
Bogumił's answer covers most of the question but I think it could be useful to add one more solution:
unwrap.(recode(a, "X"=>1, "Y"=>2, "Z"=>3))
As the length of the CategoricalArray grows relative to the number of categories, this solution becomes more performant than any of the other solutions (as of this moment) and seems like a very natural solution to me (it almost identical to OP's attempt). More importantly, the fact that it is more performant for these cases illustrates things about CategoricalArrays and what is actually happening when these functions are called.
By calling dump on a you can see the structure of this categorical array. Here is a simplified version:
CategoricalVector{String, UInt32, String, CategoricalValue{String, UInt32}, Union{}}
refs: UInt32[0x00000001, 0x00000001, 0x00000002, 0x00000003, 0x00000002, 0x00000002, 0x00000003]
pool: CategoricalPool{String, UInt32, CategoricalValue{String, UInt32}}
levels: String["X","Y","Z"]
invindex: Dict{String, UInt32}("Y" => 0x00000002, "Z" => 0x00000003, "X" => 0x00000001)
Each category is encoded as an UInt32. The encoded values are stored in the Vector refs. The CategoricalPool pool contains:
levels: a mapping from the level-code to the category (Vector{String}, the "key" is the index)invindex: a mapping from the category to the level-code (Dict{String, UInt32})This structure can very efficiently be recoded. In many cases, we could create a categorical array with new categories without touching the refs at all by just swapping out the pool part that describes the code:
mapping = Dict("X"=>1, "Y"=>2, "Z"=>3)
b = CategoricalArray{Int64,1,UInt32}(undef, 0)
b.refs = a.refs
levels!(b.pool, [mapping[l] for l in levels(a.pool)])
In the actual recode function a new empty categorical array of the same length as a is created and more edge cases are considered (perhaps most importantly the case when multiple categories are collapsed into one in the new code).
The broadcasted unwrap then consists of simple lookups in the pool that Julia is able to optimize very well.
Bogumił Kamiński's first solution:
@btime recode(unwrap.($a), "X"=>1, "Y"=>2, "Z"=>3)
| length | btime result |
|---|---|
| 100 | 1.268 μs (5 allocations: 1.84 KiB) |
| 1000 | 14.872 μs (5 allocations: 15.97 KiB) |
| 10000 | 151.881 μs (7 allocations: 156.50 KiB) |
Bogumił Kamiński's second solution:
@btime [$mapping[v] for v in $a]
| length | btime result |
|---|---|
| 100 | 2.439 μs (101 allocations: 4.00 KiB) |
| 1000 | 23.715 μs (1001 allocations: 39.19 KiB) |
| 10000 | 240.292 μs (10002 allocations: 390.70 KiB) |
Milan Bouchet-Valat's solution:
@btime recode!(similar($a, Int), $a, "X"=>1, "Y"=>2, "Z"=>3)
| length | btime result |
|---|---|
| 100 | 2.158 μs (104 allocations: 4.09 KiB) |
| 1000 | 21.347 μs (1004 allocations: 39.28 KiB) |
| 10000 | 208.035 μs (10005 allocations: 390.80 KiB) |
This solution:
@btime unwrap.(recode($a, "X"=>1, "Y"=>2, "Z"=>3))
| length | btime result |
|---|---|
| 100 | 2.360 μs (45 allocations: 4.56 KiB) |
| 1000 | 4.420 μs (45 allocations: 15.20 KiB) |
| 10000 | 20.212 μs (47 allocations: 120.55 KiB) |
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With