What is the perfect way to convert a categorical array to a simple numeric array? For example:
using CategoricalArrays
a = CategoricalArray(["X", "X", "Y", "Z", "Y", "Y", "Z"])
b = recode(a, "X"=>1, "Y"=>2, "Z"=>3)
As a result of the conversion, we still get a categorical array, even if we explicitly specify the type of assigned values:
b = recode(a, "X"=>1::Int64, "Y"=>2::Int64, "Z"=>3::Int64)
It looks like some other approach is needed here, but I can't think of a direction to look in
You have two natural options:
julia> recode(unwrap.(a), "X"=>1, "Y"=>2, "Z"=>3)
7-element Vector{Int64}:
1
1
2
3
2
2
3
or
julia> mapping = Dict("X"=>1, "Y"=>2, "Z"=>3)
Dict{String, Int64} with 3 entries:
"Y" => 2
"Z" => 3
"X" => 1
julia> [mapping[v] for v in a]
7-element Vector{Int64}:
1
1
2
3
2
2
3
the Dict
approach is slower, but it is more flexible in case you would have many levels to map.
The key function here is unwrap
that drops the "categorical" notion of CategoricalValue
(in the Dict
style unwrap
gets called automatically)
Also note that if you just want to get the levelcode
s of the values stored in a CategoricalArray
(something that R does by default) then you can just do:
julia> levelcode.(a)
7-element Vector{Int64}:
1
1
2
3
2
2
3
Also note that with levelcode
missing
is mapped to missing
:
julia> x = CategoricalArray(["Y", "X", missing, "Z"])
4-element CategoricalArray{Union{Missing, String},1,UInt32}:
"Y"
"X"
missing
"Z"
julia> levelcode.(x)
4-element Vector{Union{Missing, Int64}}:
2
1
missing
3
In addition to Bogumił's answers, a possible approach that should be quite fast is:
julia> b = recode!(similar(a, Int), a, "X"=>1, "Y"=>2, "Z"=>3)
7-element Vector{Int64}:
1
1
2
3
2
2
3
Bogumił's answer covers most of the question but I think it could be useful to add one more solution:
unwrap.(recode(a, "X"=>1, "Y"=>2, "Z"=>3))
As the length of the CategoricalArray grows relative to the number of categories, this solution becomes more performant than any of the other solutions (as of this moment) and seems like a very natural solution to me (it almost identical to OP's attempt). More importantly, the fact that it is more performant for these cases illustrates things about CategoricalArray
s and what is actually happening when these functions are called.
By calling dump
on a
you can see the structure of this categorical array. Here is a simplified version:
CategoricalVector{String, UInt32, String, CategoricalValue{String, UInt32}, Union{}}
refs: UInt32[0x00000001, 0x00000001, 0x00000002, 0x00000003, 0x00000002, 0x00000002, 0x00000003]
pool: CategoricalPool{String, UInt32, CategoricalValue{String, UInt32}}
levels: String["X","Y","Z"]
invindex: Dict{String, UInt32}("Y" => 0x00000002, "Z" => 0x00000003, "X" => 0x00000001)
Each category is encoded as an UInt32
. The encoded values are stored in the Vector
refs
. The CategoricalPool
pool
contains:
levels
: a mapping from the level-code to the category (Vector{String}
, the "key" is the index)invindex
: a mapping from the category to the level-code (Dict{String, UInt32}
)This structure can very efficiently be recoded. In many cases, we could create a categorical array with new categories without touching the refs at all by just swapping out the pool
part that describes the code:
mapping = Dict("X"=>1, "Y"=>2, "Z"=>3)
b = CategoricalArray{Int64,1,UInt32}(undef, 0)
b.refs = a.refs
levels!(b.pool, [mapping[l] for l in levels(a.pool)])
In the actual recode
function a new empty categorical array of the same length as a
is created and more edge cases are considered (perhaps most importantly the case when multiple categories are collapsed into one in the new code).
The broadcasted unwrap
then consists of simple lookups in the pool that Julia is able to optimize very well.
Bogumił Kamiński's first solution:
@btime recode(unwrap.($a), "X"=>1, "Y"=>2, "Z"=>3)
length | btime result |
---|---|
100 | 1.268 μs (5 allocations: 1.84 KiB) |
1000 | 14.872 μs (5 allocations: 15.97 KiB) |
10000 | 151.881 μs (7 allocations: 156.50 KiB) |
Bogumił Kamiński's second solution:
@btime [$mapping[v] for v in $a]
length | btime result |
---|---|
100 | 2.439 μs (101 allocations: 4.00 KiB) |
1000 | 23.715 μs (1001 allocations: 39.19 KiB) |
10000 | 240.292 μs (10002 allocations: 390.70 KiB) |
Milan Bouchet-Valat's solution:
@btime recode!(similar($a, Int), $a, "X"=>1, "Y"=>2, "Z"=>3)
length | btime result |
---|---|
100 | 2.158 μs (104 allocations: 4.09 KiB) |
1000 | 21.347 μs (1004 allocations: 39.28 KiB) |
10000 | 208.035 μs (10005 allocations: 390.80 KiB) |
This solution:
@btime unwrap.(recode($a, "X"=>1, "Y"=>2, "Z"=>3))
length | btime result |
---|---|
100 | 2.360 μs (45 allocations: 4.56 KiB) |
1000 | 4.420 μs (45 allocations: 15.20 KiB) |
10000 | 20.212 μs (47 allocations: 120.55 KiB) |
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With