I've got a 279MB file that contains ~10 million key/value pairs, with ~500,000 unique keys. It's grouped by key (each key only needs to be written once), so all the values for a given key are together.
What I want to do is transpose the association, create a file where the pairs are grouped by value, and all the keys for a given value are stored together.
Currently, I'm using Parsec to read in the pairs as a list of tuples (K,[V])
(using lazy IO so I can process it as a stream while Parsec is processing the input file), where:
newtype K = K Text deriving (Show, Eq, Ord, Hashable)
newtype V = V Text deriving (Show, Eq, Ord, Hashable)
tupleParser :: Parser (K,[V])
tupleParser = ...
data ErrList e a = Cons a (ErrList e a) | End | Err e
parseAllFromFile :: Parser a -> FilePath-> IO (ErrList ParseError a)
parseAllFromFile parser inputFile = do
contents <- readFile inputFile
let Right initialState = parse getParserState inputFile contents
return $ loop initialState
where loop state = case unconsume $ runParsecT parser' state of
Error err -> Err err
Ok Nothing _ _ -> End
Ok (Just a) state' _ -> a `Cons` loop state'
unconsume v = runIdentity $ case runIdentity v of
Consumed ma -> ma
Empty ma -> ma
parser' = (Just <$> parser) <|> (const Nothing <$> eof)
I've tried to insert the tuples into a Data.HashMap.Map V [K]
to transpose the association:
transpose :: ErrList ParseError (K,[V]) -> Either ParseError [(V,[K])]
transpose = transpose' M.empty
where transpose' _ (Err e) = Left e
transpose' m End = Right $ assocs m
transpose' m (Cons (k,vs) xs) = transpose' (L.foldl' (include k) m vs) xs
include k m v = M.insertWith (const (k:)) v [k] m
But when I tried it, I got the error:
memory allocation failed (requested 2097152 bytes)
I could think of a couple things I'm doing wrong:
I'm leaning toward (1) for now, but I'm not sure by any means.
Is there the possibility that the data will increase? If yes then I'd suggest not to read the while file into memory and process the data in another way.
One simple possibility is to use a relational database for that. This'd be fairly easy - just load your data in, create a proper index and get it sorted as you need. The database will do all the work for you. I'd definitely recommend this.
Another option would be to create your own file-based mechanism. For example:
l
that is reasonable to load into memory.n = d `div` l
files, where d
is the total amount of your data. (Hopefully this will not exceed your file descriptor limit. You could also close and reopen files after each operation, but this will make the process very slow.)(k,v)
into file number hash v `mod` l
. This ensures that the pairs with the same value v
will end up in the same file.It is essentially a hash table with file buckets. This solution assumes that each value has roughly the same number of keys (otherwise some files could get exceptionally large).
You could also implement an external sort which would allow you to sort basically any amount of data.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With