I am trying to use Haskell for data analysis. Because my datasets are reasonably large (hundreds of thousands and potentially millions of observations), I would ideally like to use an unboxed data structure for efficiency, say Data.Vector.Unboxed.
The problem is that the data contain some missing values. I want to avoid coding these as "99" or similar because that's just an ugly hack and a potential source of bugs. From my Haskell newbie point of view, I can think of the following options:
Maybe
values. Something like (please correct if wrong):data myMaybe a = Nothing | Just {-# UNPACK #-} !a
newtype instance Data.Vector.Unboxed.Vector (MyDatum a) = MyDatum (Data.Vector.Unboxed.Vector (Bool,a))
Int
for Bool
), but the only answer doesn't seem to explicitly address the issue of missing values/sparsity (instead focusing on how to represent an entire array unboxed, rather than as a boxed vector of unboxed vectors).I'm trying to stay within a vector representation rather than something like this, because it's the missing values that are sparse, not the data.
Any comments on the relative merits/feasibility/off-the-shelf-availability/likely performance of these options, or indeed pointers to entirely different alternatives, are welcome!
Edit:
I'd go with option 3, but you should not use a vector to store the missing-indizes: that gives you O(nMissing)
lookup time, which is unreasonably slow unless the missing data is extremely sparse. Data.IntMap
should do the job well, you can then easily use the member
function to check if an index points to a missing observation. Hash tables are even better but probably not necessary.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With