Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the best way to store and sort rectangular data in Haskell?

I have a handful of ASCII files containing around 17 million lines in total, and within each/most lines is a fixed 36-byte identifier. So my data is rectangular: I have a lot of rows of fixed width. Using Haskell, I want to read all the lines in, use a regex to extract the identifier (I'm fine up to there), then sort them and count the number of unique identifiers (very close to grep | sort | uniq). (I'm already parallelising by reading from each file in parallel.) Sounds like a simple problem , but...

I'm finding it hard to get decent performance out of this problem, even before the sorting stage. Here's as far as I've got. String is overkill for 36-bytes of ASCII, so I figured on using ByteString. But a (linked) list of size 17 million seems like a bad idea, so I tried IOVector ByteString. But this seems to be quite slow. I believe the garbage collection is suffering as I retain millions of small ByteStrings in the vector: the GC is taking at least 3 times as long as the code (according to +RTS -s) and I think it only gets worse as the program keeps running.

I was thinking that I should maybe use Repa or some sort of single giant ByteString/IOVector Char8/whatever (since I know the exact width of each row is 36) to store the data in one massive rectangular array for each thread, which should alleviate the GC problem. However, I do still need to sort the rows afterwards, and Repa seems to have no support for sorting, and I don't want to be writing sort algorithms myself. So I don't know how to have a giant rectangular array and yet still sort it.

Suggestions for libraries to use, GC flags to set, or anything else? The machine has 24 cores and 24GB of RAM, so I'm not constrained on hardware. I want to remain in Haskell because I have lots of related code (that is also parsing the same files and producing summary statistics) that is working fine, and I don't want to rewrite it.

like image 507
Neil Brown Avatar asked Nov 26 '25 00:11

Neil Brown


1 Answers

I believe the garbage collection is suffering as I retain millions of small ByteStrings in the vector

Suspicious. Retaining ByteStrings should not be collected. Maybe there is excessive data copying somewhere in your code?

  • ByteString is a header (8 bytes) + ForeignPtr Word8 ref (8 bytes) + Int offset (4 bytes) + Int length (4 bytes)

  • ForeignPtr is a header (8 bytes) + Addr# (8 bytes) + PlainPtr ref (8 bytes)

  • PlainPtr is a header (8 bytes) + MutableByteArray# ref (8 bytes)

(Revised according to https://stackoverflow.com/a/3256825/648955)

All in all, ByteString overhead is at least 64 bytes (correct me, of some fields are shared).

Write your own data management: big flat Word8 array and adhoc offset wrapper

newtype ByteId = ByteId { offset :: Word64 }

with Ord instance.

Overhead would be 8 bytes per identifier. Store offsets in unboxed Vector. Sort with something like this: http://hackage.haskell.org/packages/archive/vector-algorithms/0.5.4.2/doc/html/Data-Vector-Algorithms-Intro.html#v:sort

like image 83
leventov Avatar answered Nov 28 '25 16:11

leventov