Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parsing Printable Text File in Haskell

I'm trying to figure out the "right" way to parse a particular text file in Haskell.

In F#, I loop over each line, testing it against a regular expression to determine if it's a line I want to parse, and then if it is, I parse it using the regular expression. Otherwise, I ignore the line.

The file is a printable report, with headers on each page. Each record is one line, and each field is separated by two or more spaces. Here's an example:

                                                    MY COMPANY'S NAME
                                                     PROGRAM LISTING
                                             STATE:  OK     PRODUCT: ProductName
                                                 (DESCRIPTION OF REPORT)
                                                    DATE:   11/03/2013

  This is the first line of a a two-line description of the contents of this report. The description, as noted,
  spans two lines. This is more text. I'm running out of things to write. Blah.

          DIVISION CODE: 3     XYZ CODE: FAA3   AGENT CODE: 0007                                       PAGE NO:  1

 AGENT    TARGET NAME                      ST   UD   TARGET#   XYZ#   X-DATE       YEAR    CO          ENCODING
 -----    ------------------------------   --   --   -------   ----   ----------   ----    ----------  ----------

 0007     SMITH, JOHN                      43   3    1234567   001    12/06/2013   2004    ABC         SIZE XL
 0007     SMITH, JANE                      43   3    2345678   001    12/07/2013   2005    ACME        YELLOW
 0007     DOE, JOHN                        43   3    3456789   004    12/09/2013   2008    MICROSOFT   GREEN
 0007     DOE, JANE                        43   3    4567890   002    12/09/2013   2007    MICROSOFT   BLUE
 0007     BORGES, JORGE LUIS               43   3    5678901   001    12/09/2013   2008    DUFEMSCHM   Y1500
 0007     DEWEY, JOHN &                    43   3    6789012   003    12/11/2013   2013    ERTZEVILI   X1500
 0007     NIETZSCHE, FRIEDRICH             43   3    7890123   004    12/11/2013   2006    NCORPORAT   X7

I first built the parser to test each line to see if it were a record. Were it a record, I just cut up the line based on character position with my home-grown substring function. This works just fine.

Then I discovered that I did, indeed, have a regular expression library in my Haskell installation, so I decided to try using regular expressions like I do in F#. That failed miserably, as the library rejects perfectly valid regular expressions.

Then I thought, What about Parsec? But the learning curve for using that is getting steeper the higher I climb, and I find myself wondering if it is the right tool for such a simple task as parsing this report.

So I thought I'd ask some Haskell experts: how would you go about parsing this kind of report? I'm not asking for code, though if you've got some, I'd love to see it. I'm really asking for technique or technology.

Thanks!

P.s. The output is just a colon-separated file with a line of field names at the top of the file, followed by just the records, that can be imported into Excel for the end-user.

Edit:

Thank you all so much for the great comments and answers!

Because I didn't make it clear originally: The first fourteen lines of the example repeat for every page of (print) output, with the number of records varying per page from zero to a full page (looks like 45 records). I apologize for not making that clear earlier, as it will probably affect some of the answers already offered.

My Haskell system currently is limited to Parsec (it doesn't have attoparsec) and Text.Regex.Base and Text.Regex.Posix. I'll have to see about installing attoparsec and/or additional Regex libraries. But for the time being, you've convinced me to keep at learning Parsec. Thank you for the very helpful code examples!

like image 350
Jeff Maner Avatar asked Mar 21 '23 08:03

Jeff Maner


1 Answers

This is definitely a job worth of a parsing library. My primary goal is normally (i.e., for anything I intend to use more than once or twice) to get the data into a non-textual form ASAP, something like

module ReportParser where

import Prelude hiding (takeWhile)
import Data.Text hiding (takeWhile)

import Control.Applicative
import Data.Attoparsec.Text

data ReportHeaderData = Company Text
                      | Program Text
                      | State Text
--                    ...
                      | FieldNames [Text]

data ReportData = ReportData Int Text Int Int Int Int Date Int Text Text

data Date = Date Int Int Int

and we can say, for the sake of argument, that a report is

data Report = Report [ReportHeaderData] [ReportData]

Now, I generally create a parser which is a function of the same name as the data type

-- Ending condition for a field
doubleSpace :: Parser Char
doubleSpace = space >> space

-- Clears leading spaces
clearSpaces :: Parser Text
clearSpaces = takeWhile (== ' ') -- Naively assumes no tabs

-- Throws away everything up to and including a newline character (naively assumes unix line endings)
clearNewline :: Parser ()
clearNewline = (anyChar `manyTill` char '\n') *> pure ()

-- Parse a date
date :: Parser Date
date = Date <$> decimal <*> (char '/' *> decimal) <*> (char '/' *> decimal)

-- Parse a report
reportData :: Parser ReportData
reportData = let f1 = decimal <* clearSpaces
                 f2 = (pack <$> manyTill anyChar doubleSpace) <* clearSpaces
                 f3 = decimal <* clearSpaces
                 f4 = decimal <* clearSpaces
                 f5 = decimal <* clearSpaces
                 f6 = decimal <* clearSpaces
                 f7 = date <* clearSpaces
                 f8 = decimal <* clearSpaces
                 f9 = (pack <$> manyTill anyChar doubleSpace) <* clearSpaces
                 f10 = (pack <$> manyTill anyChar doubleSpace) <* clearNewline
             in ReportData <$> f1 <*> f2 <*> f3 <*> f4 <*> f5 <*> f6 <*> f7 <*> f8 <*> f9 <*> f10

By proper running of one of the parse functions and the use of one of the combinators (such as many (and possibly feed, if you end up with a Partial result), you should end up with a list of ReportDatas. You can then convert them to CSV with some function you've created.

Note that I didn't deal with the header. It should be relatively trivial to write code to parse it, and build a Report with e.g.

-- Not tested
parseReport = Report <$> (many reportHeader) <*> (many reportData)

Note that I prefer the Applicative form, but it's also possible to use the monadic form if you prefer (I did in doubleSpace). Data.Alternative is also useful, for reasons implied by the name.

For playing with this, I highly recommend GHCI and the parseTest function. GHCI is just overall handy and a good way to test individual parsers, while parseTest takes a parser and input string and outputs the status of the run, the parsed string, and any remaining string not parsed. Very useful when you're not quite sure what's going on.

like image 199
Elliot Robinson Avatar answered Mar 28 '23 02:03

Elliot Robinson