Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parsing a multi-line fixed format text file

I'm trying to parse some data in a fixed format text file where each "record" is spread over a number of lines, as so ...

 MAILBOX: 10013      Created: 01/20/09  4:39 pm
    MSGS: 0         UNPLAYED: 0           URGENT: 0          RECEIPT: 0
  LCOS: RBC Standard    : 20            FCOS: RBC Standard      : 20 
  GCOS: Default GCOS 1  : 1             NCOS: Default           : 1 
  TCOS: Default TCOS 1  : 1             RCOS:                   : 1 
BAD LOGS: 0         LAST LOG: NEVER                             MINS:      0.0
  PASSWD: Y            TUTOR: N              DAY: M            NIGHT: M       
    NAME:                                   CODE: 
   EXTEN: 10013                            INDEX: 0
ATTEN DN:                                  INDEX: 0         
DISTRIBUTION LISTS WITH CHANGE RIGHTS:
    all
DISTRIBUTION LISTS WITH REVIEW RIGHTS:
    all

I've used File Helpers before for single line records, and it's been very useful. Checking it's documentation, it does have a MultiRecordEngine feature, but this is going to mean ...

  • a class for each line ... not a problem
  • calculating the exact size of each fixed format field ... painful and open to error
  • logic to check each line

And a further wrinkle I found was the fixed format is actually not fixed, i.e. there are different format lines depending on the target record, so some have 21 lines, some 22, 23, 24, etc.

I have found a Java flat file parsing library, FFP, but I'm a .NET, C#, PowerShell coder

Are there better ways of handling this sort of parsing ?

like image 287
SteveC Avatar asked Jan 30 '12 08:01

SteveC


People also ask

What is parsing a text file?

[Google Dictionary]File parsing in computer language means to give a meaning to the characters of a text file as per the formal grammar.

What is fixed format file?

Fixed-length format files use ordinal positions, which are offsets to identify where fields are within the record. There are no field delimiters. An end-of-record delimiter is required, even for the last record.

What is a fixed width data file?

Data in a fixed-width text file is arranged in rows and columns, with one entry per row. Each column has a fixed width, specified in characters, which determines the maximum amount of data it can contain.


2 Answers

What you need is a lexer. Your record is too big to use a single Regex to parse, so you have to write one regex for each line, and a state machine to validate that the lines follows in the right order.

Or you can use a general purpose lexer/parser to generate the code for you. Wikipedia has long list. The Gold parser looks like a good candidate.

I would not try to do the lexing/parsing in PowerShell. I would rather write the code as C# or F# and use the assembly from PowerShell.

Edit: I've just looked at FileHelpers library. You could create a Multirecord Engine with a .NET Type that matches each line in you source record. All you have to do then is parse the result array for valid order and create objects.

like image 56
Huusom Avatar answered Oct 11 '22 14:10

Huusom


I've done similar in powershell, and found that using a regex in a here-string is much easier to work with:

http://mjolinor.wordpress.com/2012/01/05/powershell-multiline-regex-matching/

like image 26
mjolinor Avatar answered Oct 11 '22 13:10

mjolinor