Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to work with categorical data in Julia?

Tags:

julia

How to store categorical variables in Julia? For example, an array of strings like ["Apple", "Orange", "Banana", "Orange", "Apple", "Banana", "Apple"], is there any suitable data structure to treat the above array as a categorical type? For example, while working with sequences of DNAs, where we need to work with a large number of sequences of varying lengths, what would be the most efficient method of representing and working with such data?

like image 348
Abhijith Avatar asked Sep 16 '16 10:09

Abhijith


3 Answers

There are also the CategoricalArrays and IndirectArrays packages.

like image 196
tholy Avatar answered Oct 22 '22 11:10

tholy


There's a ton that depends on what exactly you want to do with this. Here are a few general tools though that may be useful:

Sparse Matrices

One tool that I use a lot is the sparse matrix. If you're not familiar with them already, the basic gist is that they are an efficient way to store (memory wise) and work with (processing speed wise) matrices with large numbers of zeros. When doing most statistical analyses with categorical data, one way or another, even if it is done "under the hood" by a statistical program, sparse matrices will be used in the context of categorical variables. Specifically, what these do is to represent each value of the categorical variable as a separate column in a data matrix. For a statistical analysis, you'll generally also drop one of the values of the categorical variable as a "base" state in order to avoid perfect collinearity.

In any case, below is a function that I wrote for myself that I use for this. It will convert a categorical vector into a sparse matrix. The function has options that you can tinker with by adjusting the commented out parts to either:

  • Include all of the values in the matrix or drop one as the "base" state.

  • Output a separate list of column names, which you can then use in creating, for instance, a larger total data matrix.

If you had multiple categorical variables, you'd just use this function on them multiple times and then splice together the final Array, DataFrame, or whatever.

It's a bit of a "do-it-yourself" solution - there may well be packages that more easily would do what specifically you are trying to do, but maybe not. This has the advantage of giving you a data structure that is then very general and universal, and thus which can be pretty easily plugged into math equations or algorithms that you might have which specify your analysis from here.

function OneHot(x::Vector; header::Bool = false, drop::Bool = true)
    UniqueVals = unique(x)  ## note: don't sort this - that will mess up order of vals to idx.  
    Val_to_Idx = [Val => Idx for (Idx, Val) in enumerate(unique(x))] ## create a dictionary that maps unique values in the input array to column positions in the new sparse matrix.
    ColIdx = convert(Array{Int64}, [Val_to_Idx[Val] for Val in x])
    MySparse = sparse(collect(1:length(x)),  ColIdx, ones(Int32, length(x)))
    if drop
        StartIdx = 2
    else
        StartIdx = 1
    end
    if header
        return (MySparse[:,StartIdx:end], UniqueVals[StartIdx:end])  ## I.e. gives you back a tuple, second element is the header which you can then feed to something to name the columns or do whatever else with
    else
        return MySparse[:,StartIdx:end]  ## use MySparse[:, 2:end] to drop a value
    end
end

Additional Comments:

  • If you can both categorical and continuous variables, you can then put them together into a sparse matrix, e.g. sparse([A B])
  • If you are going to do this, and your continuous variables are stored as type Float32, Float64 or whatever, then you might as well change the function from ones(Int32, length(x)) to create the ones as whatever type your continuous data is, since the ones will just get converted to that anyway when you combine the sparse matrix with your continuous data.

PooledDataArray

Going to the other end of the spectrum in terms of "do-it-yourself-ness", there is a PooledDataArray type in the DataArrays package. It is more efficient memory-wise for storing data with categorical variables that have many repeated values.

Run Length Encoding

One other helpful tool here is run length encoding. If you have a vector in which a value appears many times in a row, then run length encoding can be a more efficient way to store and work with it. There is an RLEVectors package for Julia (see here), and I believe that the developers of it had DNA and genomics stuff as their original use case.

like image 6
Michael Ohlrogge Avatar answered Oct 22 '22 09:10

Michael Ohlrogge


I would simply suggest you create your own type, that holds the data as an array of integer indices, and contains a "categories" field which contains the mapping between the indices and the category 'names' themselves e.g.

type Categorical
  data::Array{Int64}
  categories::Dict{Int64, String}
end

(also perhaps optionally an ordinal::Bool field?)

Once you have that you can create / overload methods appropriately to present results visually the way you want them, and perform comparisons and other operations via their indices under the hood.


Examples:
1) create a categorical array
julia> A = Categorical([1 2;2 1;2 2;1 1], Dict(1=>"male", 2=>"female"))

2) create a default display style for categorical objects:

julia> import Base.display
julia> display(A::Categorical) = display([A.categories[i] for i in A.data])
julia> A
4×2 Array{String,2}:
 "male"    "female"
 "female"  "male"  
 "female"  "female"
 "male"    "male"  

3) create comparisons:

julia> import Base.(.==)
julia> (.==)(A::Categorical, B::Categorical) = A.data .== B.data
julia> A .== B
4×2 BitArray{2}:
  true   true
  true   true
 false  false
 false  false

etc

like image 4
Tasos Papastylianou Avatar answered Oct 22 '22 10:10

Tasos Papastylianou