How to store categorical variables in Julia? For example, an array of strings like ["Apple", "Orange", "Banana", "Orange", "Apple", "Banana", "Apple"]
, is there any suitable data structure to treat the above array as a categorical type? For example, while working with sequences of DNAs, where we need to work with a large number of sequences of varying lengths, what would be the most efficient method of representing and working with such data?
There are also the CategoricalArrays and IndirectArrays packages.
There's a ton that depends on what exactly you want to do with this. Here are a few general tools though that may be useful:
One tool that I use a lot is the sparse matrix. If you're not familiar with them already, the basic gist is that they are an efficient way to store (memory wise) and work with (processing speed wise) matrices with large numbers of zeros. When doing most statistical analyses with categorical data, one way or another, even if it is done "under the hood" by a statistical program, sparse matrices will be used in the context of categorical variables. Specifically, what these do is to represent each value of the categorical variable as a separate column in a data matrix. For a statistical analysis, you'll generally also drop one of the values of the categorical variable as a "base" state in order to avoid perfect collinearity.
In any case, below is a function that I wrote for myself that I use for this. It will convert a categorical vector into a sparse matrix. The function has options that you can tinker with by adjusting the commented out parts to either:
Include all of the values in the matrix or drop one as the "base" state.
Output a separate list of column names, which you can then use in creating, for instance, a larger total data matrix.
If you had multiple categorical variables, you'd just use this function on them multiple times and then splice together the final Array, DataFrame, or whatever.
It's a bit of a "do-it-yourself" solution - there may well be packages that more easily would do what specifically you are trying to do, but maybe not. This has the advantage of giving you a data structure that is then very general and universal, and thus which can be pretty easily plugged into math equations or algorithms that you might have which specify your analysis from here.
function OneHot(x::Vector; header::Bool = false, drop::Bool = true)
UniqueVals = unique(x) ## note: don't sort this - that will mess up order of vals to idx.
Val_to_Idx = [Val => Idx for (Idx, Val) in enumerate(unique(x))] ## create a dictionary that maps unique values in the input array to column positions in the new sparse matrix.
ColIdx = convert(Array{Int64}, [Val_to_Idx[Val] for Val in x])
MySparse = sparse(collect(1:length(x)), ColIdx, ones(Int32, length(x)))
if drop
StartIdx = 2
else
StartIdx = 1
end
if header
return (MySparse[:,StartIdx:end], UniqueVals[StartIdx:end]) ## I.e. gives you back a tuple, second element is the header which you can then feed to something to name the columns or do whatever else with
else
return MySparse[:,StartIdx:end] ## use MySparse[:, 2:end] to drop a value
end
end
Additional Comments:
sparse([A B])
Float32
, Float64
or whatever, then you might as well change the function from ones(Int32, length(x))
to create the ones as whatever type your continuous data is, since the ones will just get converted to that anyway when you combine the sparse matrix with your continuous data.Going to the other end of the spectrum in terms of "do-it-yourself-ness", there is a PooledDataArray
type in the DataArrays package. It is more efficient memory-wise for storing data with categorical variables that have many repeated values.
One other helpful tool here is run length encoding. If you have a vector in which a value appears many times in a row, then run length encoding can be a more efficient way to store and work with it. There is an RLEVectors package for Julia (see here), and I believe that the developers of it had DNA and genomics stuff as their original use case.
I would simply suggest you create your own type, that holds the data as an array of integer indices, and contains a "categories" field which contains the mapping between the indices and the category 'names' themselves e.g.
type Categorical
data::Array{Int64}
categories::Dict{Int64, String}
end
(also perhaps optionally an ordinal::Bool
field?)
Once you have that you can create / overload methods appropriately to present results visually the way you want them, and perform comparisons and other operations via their indices under the hood.
julia> A = Categorical([1 2;2 1;2 2;1 1], Dict(1=>"male", 2=>"female"))
2) create a default display style for categorical objects:
julia> import Base.display
julia> display(A::Categorical) = display([A.categories[i] for i in A.data])
julia> A
4×2 Array{String,2}:
"male" "female"
"female" "male"
"female" "female"
"male" "male"
3) create comparisons:
julia> import Base.(.==)
julia> (.==)(A::Categorical, B::Categorical) = A.data .== B.data
julia> A .== B
4×2 BitArray{2}:
true true
true true
false false
false false
etc
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With