Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient handling of sparsely missing data in Haskell

I am trying to use Haskell for data analysis. Because my datasets are reasonably large (hundreds of thousands and potentially millions of observations), I would ideally like to use an unboxed data structure for efficiency, say Data.Vector.Unboxed.

The problem is that the data contain some missing values. I want to avoid coding these as "99" or similar because that's just an ugly hack and a potential source of bugs. From my Haskell newbie point of view, I can think of the following options:

  1. A boxed vector of unpacked Maybe values. Something like (please correct if wrong):
    data myMaybe a = Nothing | Just {-# UNPACK #-} !a
  2. An unboxed vector of (unboxable) tuples, whith a boolean element indicating missingness:
    newtype instance Data.Vector.Unboxed.Vector (MyDatum a) = MyDatum (Data.Vector.Unboxed.Vector (Bool,a))
    This may be the same approach as chosen by the OP of this question (modulo Int for Bool), but the only answer doesn't seem to explicitly address the issue of missing values/sparsity (instead focusing on how to represent an entire array unboxed, rather than as a boxed vector of unboxed vectors).
  3. A tuple of unboxed vectors, one with the values, the other with the indices at which missing values are to be injected, or the run lengths of non-missing values, or some equivalent information. This might be preferable to option 2. if missings are sparse?

I'm trying to stay within a vector representation rather than something like this, because it's the missing values that are sparse, not the data.

Any comments on the relative merits/feasibility/off-the-shelf-availability/likely performance of these options, or indeed pointers to entirely different alternatives, are welcome!

Edit:

  • It's been pointed out that the answer potentially depends on what kind of operations I intend to perform on the data. At the moment, it seems more convenient to store each observation in a single vector, rather than each variable. Since the entries in the vector will therefore refer to different variables, "fold"-like ops are unlikely.
  • I'm guessing 2. will internally store the "valid bit" vector à la 3. automatically if appropriate, so 3. could be dropped?
like image 589
Bilal Barakat Avatar asked Nov 13 '11 09:11

Bilal Barakat


1 Answers

I'd go with option 3, but you should not use a vector to store the missing-indizes: that gives you O(nMissing) lookup time, which is unreasonably slow unless the missing data is extremely sparse. Data.IntMap should do the job well, you can then easily use the member function to check if an index points to a missing observation. Hash tables are even better but probably not necessary.

like image 181
leftaroundabout Avatar answered Oct 27 '22 22:10

leftaroundabout