Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Limiting memory usage when reading files

I'm a Haskell beginner and thought this would be good exercise. I have an assignment where I need to read file in a thread A, handle the file lines in threads B_i, and then output the results in thread C.

I have implemented this far already, but one of the requirements is that we cannot trust that the entire file fits into memory. I was hoping that lazy IO and garbage collector would do this for me, but alas the memory usage keeps rising and rising.

The reader thread (A) reads the file with readFile which is then zipped with line numbers and wrapped in Just. These zipped lines are then written to Control.Concurrent.Chan. Each consumer thread B has its own channel.

Each consumer reads their own channel when it has data and if the regex matches, it's outputted to their own respective output channel wrapped within Maybe (made of lists).

The printer checks the output channel of each of the B threads. If none of the results (line) is Nothing, the line is printed. Since at this point there should be no reference to the older lines, I thought that the garbage collector would be able to release these lines, but alas I seem to be in the wrong here.

The .lhs file is in here: http://gitorious.org/hajautettujen-sovellusten-muodostamistekniikat/hajautettujen-sovellusten-muodostamistekniikat/blobs/master/mgrep.lhs

So the question is, how do I limit the memory usage, or allow the garbage collector to remove the lines.

Snippets as per requested. Hopefully indenting isn't too badly destroyed :)

data Global = Global {done :: MVar Bool, consumers :: Consumers}
type Done = Bool
type Linenum = Int
type Line = (Linenum, Maybe String)
type Output = MVar [Line]
type Input = Chan Line
type Consumers = MVar (M.Map ThreadId (Done, (Input, Output)))
type State a = ReaderT Global IO a


producer :: [Input] -> FilePath -> State ()
producer c p = do
  liftIO $ Main.log "Starting producer"
  d <- asks done
  f <- liftIO $ readFile p
  mapM_ (\l -> mapM_
    (liftIO . flip writeChan l) c)
    $ zip [1..] $ map Just $ lines f
  liftIO $ modifyMVar_ d (return . not)

printer :: State ()
printer = do
  liftIO $ Main.log "Starting printer"
  c <- (fmap (map (snd . snd) . M.elems)
    (asks consumers >>= liftIO . readMVar))
  uniq' c
  where head' :: Output -> IO Line
    head' ch = fmap head (readMVar ch)

    tail' = mapM_ (liftIO . flip modifyMVar_
        (return . tail))

    cont ch = tail' ch >> uniq' ch

    printMsg ch = readMVar (head ch) >>=
        liftIO . putStrLn . fromJust . snd . head

    cempty :: [Output] -> IO Bool
    cempty ch = fmap (any id)
        (mapM (fmap ((==) 0 . length) . readMVar ) ch)

    {- Return false unless none are Nothing -}
    uniq :: [Output] -> IO Bool
    uniq ch = fmap (any id . map (isNothing . snd))
        (mapM (liftIO . head') ch)

    uniq' :: [Output] -> State ()
    uniq' ch = do
      d <- consumersDone
      e <- liftIO $ cempty ch
      if not e
        then  do
          u <- liftIO $ uniq ch
          if u then cont ch else do
        liftIO $ printMsg ch
        cont ch
          else unless d $ uniq' ch
like image 366
Masse Avatar asked Sep 19 '10 11:09

Masse


People also ask

How can you reduce data read into memory?

Another way to decrease the memory usage of our data is to truncate numerical items in the data. For example, whenever we load a CSV into a column in a data frame, if the file contains numbers, it will store it as which takes 64 bytes to store one numerical value.

How do I stop from using so much memory?

How do I stop high RAM usage? To stop high RAM usage, start small by quitting programs you aren't using, restarting your computer, uninstalling unneeded programs, and scanning for any malware that may be on your device. If this doesn't work, consider downloading a RAM cleaning software or even installing more RAM.

Which consumes more memory while importing data?

The default field types are the most memory-consuming, e.g., int64 is the default type for integer field.

What is causing high memory usage?

All computer memory is connected to the CPU and RAM. However, the high memory usage problem is mainly due to the overcrowding of many internal processes. Therefore, it helps to stop the unnecessary programs and applications that are running. Open the Task Manager and check any extra programs you aren't using.


1 Answers

Concurrent programming offers no defined execution order unless you enforce one yourself with mvars and the like. So its likely that the producer thread sticks all/most of the lines in the chan before any consumer reads them off and passes them on. Another architecture that should fit the requirements is just have thread A call the lazy readfile and stick the result in an mvar. Then each consumer thread takes the mvar, reads a line, then replaces the mvar before proceeding to handle the line. Even then, if the output thread can't keep up, then the number of matching lines stored on the chan there can build up arbitrarily.

What you have is a push architecture. To really make it work in constant space, think in terms of demand driven. Find a mechanism such that the output thread signals to the processing threads that they should do something, and such that the processing threads signal to the reader thread that they should do something.

Another way to do this is to have chans of limited size instead -- so the reader thread blocks when the processor threads haven't caught up, and so the processor threads block when the output thread hasn't caught up.

As a whole, the problem in fact reminds me of Tim Bray's widefinder benchmark, although the requirements are somewhat different. In any case, it led to a widespread discussion on the best way to implement multicore grep. The big punchline was that the problem is IO bound, and you want multiple reader threads over mmapped files.

See here for more than you'll ever want to know: http://www.tbray.org/ongoing/When/200x/2007/09/20/Wide-Finder

like image 193
sclv Avatar answered Oct 03 '22 15:10

sclv