Memory failure when transposing [(K,[V])] to [(V,[K])]

Question

I've got a 279MB file that contains ~10 million key/value pairs, with ~500,000 unique keys. It's grouped by key (each key only needs to be written once), so all the values for a given key are together.

What I want to do is transpose the association, create a file where the pairs are grouped by value, and all the keys for a given value are stored together.

Currently, I'm using Parsec to read in the pairs as a list of tuples (K,[V]) (using lazy IO so I can process it as a stream while Parsec is processing the input file), where:

newtype K = K Text deriving (Show, Eq, Ord, Hashable)
newtype V = V Text deriving (Show, Eq, Ord, Hashable)

tupleParser :: Parser (K,[V])
tupleParser = ...

data ErrList e a = Cons a (ErrList e a) | End | Err e                

parseAllFromFile :: Parser a -> FilePath-> IO (ErrList ParseError a) 
parseAllFromFile parser inputFile = do                               
  contents <- readFile inputFile                                     
  let Right initialState = parse getParserState inputFile contents   
  return $ loop initialState                                         
  where loop state = case unconsume $ runParsecT parser' state of    
                        Error err             -> Err err             
                        Ok Nothing _ _        -> End                 
                        Ok (Just a) state' _  -> a `Cons` loop state'
        unconsume v = runIdentity $ case runIdentity v of            
                                      Consumed ma -> ma              
                                      Empty ma -> ma                 
        parser' = (Just <$> parser) <|> (const Nothing <$> eof)

I've tried to insert the tuples into a Data.HashMap.Map V [K] to transpose the association:

transpose :: ErrList ParseError (K,[V]) -> Either ParseError [(V,[K])]          
transpose = transpose' M.empty                                                   
  where transpose' _ (Err e)          = Left e                                
        transpose' m End              = Right $ assocs m                      
        transpose' m (Cons (k,vs) xs) = transpose' (L.foldl' (include k) m vs) xs
        include k m v = M.insertWith (const (k:)) v [k] m

But when I tried it, I got the error:

memory allocation failed (requested 2097152 bytes)

I could think of a couple things I'm doing wrong:

2MB seems a bit low (considerably less than the 2GB RAM my machine has installed), so maybe I need to tell GHC it's ok to use more?
My problems could be algorithmic/data structure related. Maybe I'm using the wrong tools for the job?
My attempt to use lazy IO could be coming back to bite me.

I'm leaning toward (1) for now, but I'm not sure by any means.

Petr · Accepted Answer

Is there the possibility that the data will increase? If yes then I'd suggest not to read the while file into memory and process the data in another way.

One simple possibility is to use a relational database for that. This'd be fairly easy - just load your data in, create a proper index and get it sorted as you need. The database will do all the work for you. I'd definitely recommend this.

Another option would be to create your own file-based mechanism. For example:

Choose some limit l that is reasonable to load into memory.
Create n = d `div` l files, where d is the total amount of your data. (Hopefully this will not exceed your file descriptor limit. You could also close and reopen files after each operation, but this will make the process very slow.)
Process the input sequentially and place each pair (k,v) into file number hash v `mod` l. This ensures that the pairs with the same value v will end up in the same file.
Process each file separately.
Merge them together.

It is essentially a hash table with file buckets. This solution assumes that each value has roughly the same number of keys (otherwise some files could get exceptionally large).

You could also implement an external sort which would allow you to sort basically any amount of data.

Memory failure when transposing [(K,[V])] to [(V,[K])]

Tags:

memory

haskell

parsec

rampion

Video Answer

1 Answers

Petr

Recent Activity

Donate For Us

Memory failure when transposing [(K,[V])] to [(V,[K])]

Tags:

memory

haskell

parsec

rampion

Video Answer

1 Answers

Petr

Related questions

Recent Activity

Donate For Us