Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Strategies to handle a file with multiple fixed formats

Tags:

perl

This question is not Perl-specific, (although the unpack function will most probably figure into my implementation).

I have to deal with files where multiple formats exist to hierarchically break down the data into meaningful sections. What I'd like to be able to do is parse the file data into a suitable data structure.

Here's an example (commentary on RHS):

                                       # | Format | Level | Comment
                                       # +--------+-------+---------
**DEVICE 109523.69142                  #        1       1   file-specific
  .981    561A                         #        2       1
10/MAY/2010    24.15.30,13.45.03       #        3       2   group of records
05:03:01   AB23X  15.67   101325.72    #        4       3   part of single record
*           14  31.30474 13        0   #        5       3   part of single record
05:03:15   CR22X  16.72   101325.42    #        4       3   new record
*           14  29.16264 11        0   #        5       3
06:23:51   AW41X  15.67    101323.9    #        4       3
*           14  31.26493219        0   #        5       3
11/MAY/2010    24.07.13,13.44.63       #        3       2   group of new records
15:57:14   AB23X  15.67   101327.23    #        4       3   part of single record
*           14  31.30474 13        0   #        5       3   part of single record
15:59:59   CR22X  16.72   101331.88    #        4       3   new record
*           14  29.16264 11        0   #        5

The logic I have at the moment is fragile:

  • I know, for instance, that a Format 2 always comes after a Format 1, and that they only span 2 lines.
  • I also know that Formats 4 and 5 always come in pairs as they correspond to a single record. The number of records may be variable
  • I'm using regular expressions to infer the format of each line. However, this is risky and does not lend to flexibility in the future (when someone decides to change the format of the output).

The big question here is about what strategies I can employ to determine which format needs to be used for which line. I'd be interested to know if others have faced similar situations and what they've done to address it.

like image 502
Zaid Avatar asked Sep 03 '10 19:09

Zaid


People also ask

What is fixed format file?

Fixed-length format files use ordinal positions, which are offsets to identify where fields are within the record. There are no field delimiters. An end-of-record delimiter is required, even for the last record.

How does Informatica handle fixed width files?

To configure a fixed-width file format, you specify the number of columns and the width, name, and datatype for each column. You can also set advanced fixed-width format properties. For example, you can specify how to handle null characters or specify the default date format for each column.

How do you read a fixed width file?

Method 1: Using read.fwf function from utils package. We have to use column widths for reading. Syntax: read. fwf(file, widths, header = FALSE, sep = “\t”, skip = 0, row.


1 Answers

Toying with an answer to your question, I arrived at an interesting solution with a concise main loop:

while (<>) {
  given($_) {
    when (@{[ map $pattern{$_}, @expect]}) {}
    default {
      die "$0: line $.: expected " . join("|" => @expect) . "; got\n$_";
    }
  }
}

As you'll see below, %pattern is a hash of named patterns for the different formats, and given/when against an array of Regex objects performs a short-circuiting search to find the first match.

From this, you can infer that @expect is a list of names of formats we expect to find on the current line.

For a while, I was stuck on the case of multiple possible expected formats and how to know format just matched, but then I remembered (?{ code }) in regular expressions:

This zero-width assertion evaluates any embedded Perl code. It always succeeds, and its code is not interpolated.

This allows something like a poor man's yacc grammar. For example, the pattern to match and process format 1 is

fmt1 => qr/^ \*\* DEVICE \s+ (\S+) \s*$
             (?{ $device->{attr1} = $1;
                 @expect = qw< fmt2 >;
               })
          /x,

After processing the input from your question, $device contains

{
  'attr1' => '109523.69142',
  'attr2' => '.981',
  'attr3' => '561A',
  'groups' => [
    {
      'date' => '10/MAY/2010',
      'nnn' => [ '24.15.30', '13.45.03' ],
      'records' => [
        [ '05:03:01', 'AB23X', '15.67', '101325.72', '14', '31.30474',  '13', '0' ],
        [ '05:03:15', 'CR22X', '16.72', '101325.42', '14', '29.16264',  '11', '0' ],
        [ '06:23:51', 'AW41X', '15.67', '101323.9',  '14', '31.264932', '19', '0' ],
      ],
    },
    {
      'date' => '11/MAY/2010',
      'nnn' => [ '24.07.13', '13.44.63' ],
      'records' => [
        [ '15:57:14', 'AB23X', '15.67', '101327.23', '14', '31.30474', '13', '0' ],
        [ '15:59:59', 'CR22X', '16.72', '101331.88', '14', '29.16264', '11', '0' ],
      ],
    }
  ],
}

I'm amused with the result, but for some reason Larry's advice in perlstyle comes to mind:

Just because you CAN do something a particular way doesn't mean that you SHOULD do it that way.


For completeness, a working program demonstrating the result is below.

#! /usr/bin/perl

use warnings;
use strict;
use feature ':5.10';
use re 'eval';

*ARGV = *DATA;

my $device;
my $record;
my @expect = qw/ fmt1 /;
my %pattern;
%pattern = (
  fmt1 => qr/^ \*\* DEVICE \s+ (\S+) \s*$
               (?{ $device->{attr1} = $1;
                   @expect = qw< fmt2 >;
                 })
            /x,

  fmt2 => qr/^ \s* (\S+) \s+ (\S+) \s*$
               (?{ @{$device}{qw< attr2 attr3 >} = ($1,$2);
                   @expect = qw< fmt3 >;
                 })
            /x,

  # e.g., 10/MAY/2010    24.15.30,13.45.03
  fmt3 => qr/^ (\d\d\/[A-Z]{3}\/\d{4}) \s+ (\S+) \s*$
               (?{ my($date,$nnns) = ($1,$2);
                   push @{ $device->{groups} } =>
                     { nnn  => [ split m|,| => $nnns ],
                       date => $date };
                   @expect = qw< fmt4 >;
                 })
            /x,

  # e.g., 05:03:01   AB23X  15.67   101325.72
  fmt4 => qr/^ (\d\d:\d\d:\d\d) \s+
               (\S+) \s+ (\S+) \s+ (\S+)
               \s*$
               (?{ push @{ $device->{groups}[-1]{records} } =>
                        [ $1, $2, $3, $4 ];
                   @expect = qw< fmt4 fmt5 >;
                 })
            /x,

  # e.g., *           14  31.30474 13        0
  fmt5 => qr/^\* \s+ (\d+) \s+
              # tricky: possibly no whitespace after 9-char float
              ((?=\d{1,7}\.\d+)[\d.]{1,9}) \s*
              (\d+) \s+ (\d+)
              \s*$
              (?{ push @{ $device->{groups}[-1]{records}[-1] } =>
                        $1, $2, $3, $4;
                  @expect = qw< fmt4 fmt3 fmt2 >;
                })
            /x,
);

while (<>) {
  given($_) {
    when (@{[ map $pattern{$_}, @expect]}) {}
    default {
      die "$0: line $.: expected " . join("|" => @expect) . "; got\n$_";
    }
  }
}

use Data::Dumper;
$Data::Dumper::Terse = $Data::Dumper::Indent = 1;
print Dumper $device;

__DATA__
**DEVICE 109523.69142
  .981    561A
10/MAY/2010    24.15.30,13.45.03
05:03:01   AB23X  15.67   101325.72
*           14  31.30474 13        0
05:03:15   CR22X  16.72   101325.42
*           14  29.16264 11        0
06:23:51   AW41X  15.67    101323.9
*           14  31.26493219        0
11/MAY/2010    24.07.13,13.44.63
15:57:14   AB23X  15.67   101327.23
*           14  31.30474 13        0
15:59:59   CR22X  16.72   101331.88
*           14  29.16264 11        0
like image 115
Greg Bacon Avatar answered Nov 14 '22 23:11

Greg Bacon