I'm trying to figure out the "right" way to parse a particular text file in Haskell.
In F#, I loop over each line, testing it against a regular expression to determine if it's a line I want to parse, and then if it is, I parse it using the regular expression. Otherwise, I ignore the line.
The file is a printable report, with headers on each page. Each record is one line, and each field is separated by two or more spaces. Here's an example:
MY COMPANY'S NAME
PROGRAM LISTING
STATE: OK PRODUCT: ProductName
(DESCRIPTION OF REPORT)
DATE: 11/03/2013
This is the first line of a a two-line description of the contents of this report. The description, as noted,
spans two lines. This is more text. I'm running out of things to write. Blah.
DIVISION CODE: 3 XYZ CODE: FAA3 AGENT CODE: 0007 PAGE NO: 1
AGENT TARGET NAME ST UD TARGET# XYZ# X-DATE YEAR CO ENCODING
----- ------------------------------ -- -- ------- ---- ---------- ---- ---------- ----------
0007 SMITH, JOHN 43 3 1234567 001 12/06/2013 2004 ABC SIZE XL
0007 SMITH, JANE 43 3 2345678 001 12/07/2013 2005 ACME YELLOW
0007 DOE, JOHN 43 3 3456789 004 12/09/2013 2008 MICROSOFT GREEN
0007 DOE, JANE 43 3 4567890 002 12/09/2013 2007 MICROSOFT BLUE
0007 BORGES, JORGE LUIS 43 3 5678901 001 12/09/2013 2008 DUFEMSCHM Y1500
0007 DEWEY, JOHN & 43 3 6789012 003 12/11/2013 2013 ERTZEVILI X1500
0007 NIETZSCHE, FRIEDRICH 43 3 7890123 004 12/11/2013 2006 NCORPORAT X7
I first built the parser to test each line to see if it were a record. Were it a record, I just cut up the line based on character position with my home-grown substring function. This works just fine.
Then I discovered that I did, indeed, have a regular expression library in my Haskell installation, so I decided to try using regular expressions like I do in F#. That failed miserably, as the library rejects perfectly valid regular expressions.
Then I thought, What about Parsec? But the learning curve for using that is getting steeper the higher I climb, and I find myself wondering if it is the right tool for such a simple task as parsing this report.
So I thought I'd ask some Haskell experts: how would you go about parsing this kind of report? I'm not asking for code, though if you've got some, I'd love to see it. I'm really asking for technique or technology.
Thanks!
P.s. The output is just a colon-separated file with a line of field names at the top of the file, followed by just the records, that can be imported into Excel for the end-user.
Edit:
Thank you all so much for the great comments and answers!
Because I didn't make it clear originally: The first fourteen lines of the example repeat for every page of (print) output, with the number of records varying per page from zero to a full page (looks like 45 records). I apologize for not making that clear earlier, as it will probably affect some of the answers already offered.
My Haskell system currently is limited to Parsec (it doesn't have attoparsec) and Text.Regex.Base and Text.Regex.Posix. I'll have to see about installing attoparsec and/or additional Regex libraries. But for the time being, you've convinced me to keep at learning Parsec. Thank you for the very helpful code examples!
This is definitely a job worth of a parsing library. My primary goal is normally (i.e., for anything I intend to use more than once or twice) to get the data into a non-textual form ASAP, something like
module ReportParser where
import Prelude hiding (takeWhile)
import Data.Text hiding (takeWhile)
import Control.Applicative
import Data.Attoparsec.Text
data ReportHeaderData = Company Text
| Program Text
| State Text
-- ...
| FieldNames [Text]
data ReportData = ReportData Int Text Int Int Int Int Date Int Text Text
data Date = Date Int Int Int
and we can say, for the sake of argument, that a report is
data Report = Report [ReportHeaderData] [ReportData]
Now, I generally create a parser which is a function of the same name as the data type
-- Ending condition for a field
doubleSpace :: Parser Char
doubleSpace = space >> space
-- Clears leading spaces
clearSpaces :: Parser Text
clearSpaces = takeWhile (== ' ') -- Naively assumes no tabs
-- Throws away everything up to and including a newline character (naively assumes unix line endings)
clearNewline :: Parser ()
clearNewline = (anyChar `manyTill` char '\n') *> pure ()
-- Parse a date
date :: Parser Date
date = Date <$> decimal <*> (char '/' *> decimal) <*> (char '/' *> decimal)
-- Parse a report
reportData :: Parser ReportData
reportData = let f1 = decimal <* clearSpaces
f2 = (pack <$> manyTill anyChar doubleSpace) <* clearSpaces
f3 = decimal <* clearSpaces
f4 = decimal <* clearSpaces
f5 = decimal <* clearSpaces
f6 = decimal <* clearSpaces
f7 = date <* clearSpaces
f8 = decimal <* clearSpaces
f9 = (pack <$> manyTill anyChar doubleSpace) <* clearSpaces
f10 = (pack <$> manyTill anyChar doubleSpace) <* clearNewline
in ReportData <$> f1 <*> f2 <*> f3 <*> f4 <*> f5 <*> f6 <*> f7 <*> f8 <*> f9 <*> f10
By proper running of one of the parse functions and the use of one of the combinators (such as many
(and possibly feed
, if you end up with a Partial result), you should end up with a list of ReportData
s. You can then convert them to CSV with some function you've created.
Note that I didn't deal with the header. It should be relatively trivial to write code to parse it, and build a Report
with e.g.
-- Not tested
parseReport = Report <$> (many reportHeader) <*> (many reportData)
Note that I prefer the Applicative form, but it's also possible to use the monadic form if you prefer (I did in doubleSpace
). Data.Alternative
is also useful, for reasons implied by the name.
For playing with this, I highly recommend GHCI and the parseTest
function. GHCI is just overall handy and a good way to test individual parsers, while parseTest takes a parser and input string and outputs the status of the run, the parsed string, and any remaining string not parsed. Very useful when you're not quite sure what's going on.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With