Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Operating on parsed data with attoparsec

Background

I've written a logfile parser using attoparsec. All my smaller parsers succeed, as does the composed final parser. I've confirmed this with tests. But I'm stumbling over performing operations with the parsed stream.

What I've tried

I started by trying to pass the successfully parsed input to a function. But all the seems to get is Done (), which I'm presuming means the logfile has been consumed by this point.

prepareStats :: Result Log -> IO ()
prepareStats r =
case r of
    Fail _ _ _ -> putStrLn $ "Parsing failed"
    Done _ parsedLog -> putStrLn "Success" -- This now has a [LogEntry] array. Do something with it.

main :: IO ()
main = do
[f] <- getArgs
logFile <- B.readFile (f :: FilePath)
let results = parseOnly parseLog logFile
putStrLn "TBC"

What I'm trying to do

I want to accumulate some stats from the logfile as I consume the input. For example, I'm parsing response codes and I'd like to count how many 2** responses there were and how many 4/5** ones. I'm parsing the number of bytes each response returned as Ints, and I'd like to efficiently sum these (sounds like a foldl'?). I've defined a data type like this:

data Stats = Stats {
    successfulRequestsPerMinute :: Int
  , failingRequestsPerMinute    :: Int
  , meanResponseTime            :: Int
  , megabytesPerMinute          :: Int
  } deriving Show

And I'd like to constantly update that as I parse the input. But the part of performing operations as I consume is where I got stuck. So far print is the only function I've successfully passed output to and it showed the parsing is succeeding by returning Done before printing the output.

My main parser(s) look like this:

parseLogEntry :: Parser LogEntry
parseLogEntry = do
ip <- logItem
_ <- char ' '
logName <- logItem
_ <- char ' '
user <- logItem
_ <- char ' '
time <- datetimeLogItem
_ <- char ' '
firstLogLine <- quotedLogItem
_ <- char ' '
finalRequestStatus <- intLogItem
_ <- char ' '
responseSizeB <- intLogItem
_ <- char ' '
timeToResponse <- intLogItem
return $ LogEntry ip logName user time firstLogLine finalRequestStatus responseSizeB timeToResponse

type Log = [LogEntry]

parseLog :: Parser Log
parseLog = many $ parseLogEntry <* endOfLine

Desired outcome

I want to pass each parsed line to a function that will update the above data type. Ideally I want this to be very memory efficient because it'll be operating on large files.

like image 908
Garry Cairns Avatar asked Sep 26 '22 19:09

Garry Cairns


1 Answers

You have to make your unit of parsing a single log entry rather than a list of log entries.

It's not pretty, but here is an example of how to interleave parsing and processing:

(Depends on bytestring, attoparsec and mtl)

{-# LANGUAGE NoMonomorphismRestriction, FlexibleContexts #-}

import qualified Data.ByteString.Char8 as BS
import qualified Data.Attoparsec.ByteString.Char8 as A
import Data.Attoparsec.ByteString.Char8 hiding (takeWhile)
import Data.Char
import Control.Monad.State.Strict

aWord :: Parser BS.ByteString
aWord = skipSpace >> A.takeWhile isAlphaNum

getNext :: MonadState [a] m => m (Maybe a)
getNext = do
  xs <- get
  case xs of
    [] -> return Nothing
    (y:ys) -> put ys >> return (Just y)

loop iresult =
  case iresult of
    Fail _ _ msg  -> error $ "parse failed: " ++ msg
    Done x' aword -> do lift $ process aword; loop (parse aWord x')
    Partial _     -> do
      mx <- getNext
      case mx of
        Just y  -> loop (feed iresult y)
        Nothing -> case feed iresult BS.empty of
                     Fail _ _ msg  -> error $ "parse failed: " ++ msg
                     Done x' aword -> do lift $ process aword; return ()
                     Partial _     -> error $ "partial returned"  -- probably can't happen

process :: Show a => a -> IO ()
process w = putStrLn $ "got a word: " ++ show w

theWords = map BS.pack [ "this is a te", "st of the emergency ", "broadcasting sys", "tem"]


main = runStateT (loop (Partial (parse aWord))) theWords

Notes:

  • We parse a aWord at a time and call process after each word is recognized.
  • Use feed to feed the parser more input when it returns a Partial.
  • Feed the parser an empty string when there is no more input left.
  • When Done is return, process the recognized word and continue with parse aWord.
  • getNext is just an example of a monadic function which gets the next unit of input. Replace it with your own version - i.e. something that reads the next line from a file.

Update

Here is a solution using parseWith as @dfeuer suggested:

noMoreInput = fmap null get

loop2 x = do
  iresult <- parseWith (fmap (fromMaybe BS.empty) getNext) aWord x
  case iresult of
    Fail _ _ msg  -> error $ "parse failed: " ++ msg
    Done x' aword -> do lift $ process aword;
                        if BS.null x'
                           then do b <- noMoreInput
                                   if b then return ()
                                        else loop2 x'
                           else loop2 x'
    Partial _     -> error $ "huh???" -- this really can't happen

main2 = runStateT (loop2 BS.empty) theWords
like image 97
ErikR Avatar answered Sep 30 '22 07:09

ErikR