Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Memory failure when transposing [(K,[V])] to [(V,[K])]

I've got a 279MB file that contains ~10 million key/value pairs, with ~500,000 unique keys. It's grouped by key (each key only needs to be written once), so all the values for a given key are together.

What I want to do is transpose the association, create a file where the pairs are grouped by value, and all the keys for a given value are stored together.

Currently, I'm using Parsec to read in the pairs as a list of tuples (K,[V]) (using lazy IO so I can process it as a stream while Parsec is processing the input file), where:

newtype K = K Text deriving (Show, Eq, Ord, Hashable)
newtype V = V Text deriving (Show, Eq, Ord, Hashable)

tupleParser :: Parser (K,[V])
tupleParser = ...

data ErrList e a = Cons a (ErrList e a) | End | Err e                

parseAllFromFile :: Parser a -> FilePath-> IO (ErrList ParseError a) 
parseAllFromFile parser inputFile = do                               
  contents <- readFile inputFile                                     
  let Right initialState = parse getParserState inputFile contents   
  return $ loop initialState                                         
  where loop state = case unconsume $ runParsecT parser' state of    
                        Error err             -> Err err             
                        Ok Nothing _ _        -> End                 
                        Ok (Just a) state' _  -> a `Cons` loop state'
        unconsume v = runIdentity $ case runIdentity v of            
                                      Consumed ma -> ma              
                                      Empty ma -> ma                 
        parser' = (Just <$> parser) <|> (const Nothing <$> eof)      

I've tried to insert the tuples into a Data.HashMap.Map V [K] to transpose the association:

transpose :: ErrList ParseError (K,[V]) -> Either ParseError [(V,[K])]          
transpose = transpose' M.empty                                                   
  where transpose' _ (Err e)          = Left e                                
        transpose' m End              = Right $ assocs m                      
        transpose' m (Cons (k,vs) xs) = transpose' (L.foldl' (include k) m vs) xs
        include k m v = M.insertWith (const (k:)) v [k] m                  

But when I tried it, I got the error:

memory allocation failed (requested 2097152 bytes)

I could think of a couple things I'm doing wrong:

  1. 2MB seems a bit low (considerably less than the 2GB RAM my machine has installed), so maybe I need to tell GHC it's ok to use more?
  2. My problems could be algorithmic/data structure related. Maybe I'm using the wrong tools for the job?
  3. My attempt to use lazy IO could be coming back to bite me.

I'm leaning toward (1) for now, but I'm not sure by any means.

like image 974
rampion Avatar asked Dec 06 '12 01:12

rampion


Video Answer


1 Answers

Is there the possibility that the data will increase? If yes then I'd suggest not to read the while file into memory and process the data in another way.

One simple possibility is to use a relational database for that. This'd be fairly easy - just load your data in, create a proper index and get it sorted as you need. The database will do all the work for you. I'd definitely recommend this.


Another option would be to create your own file-based mechanism. For example:

  1. Choose some limit l that is reasonable to load into memory.
  2. Create n = d `div` l files, where d is the total amount of your data. (Hopefully this will not exceed your file descriptor limit. You could also close and reopen files after each operation, but this will make the process very slow.)
  3. Process the input sequentially and place each pair (k,v) into file number hash v `mod` l. This ensures that the pairs with the same value v will end up in the same file.
  4. Process each file separately.
  5. Merge them together.

It is essentially a hash table with file buckets. This solution assumes that each value has roughly the same number of keys (otherwise some files could get exceptionally large).


You could also implement an external sort which would allow you to sort basically any amount of data.

like image 99
Petr Avatar answered Nov 16 '22 03:11

Petr