Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I extract/parse tabular data from a text file in Perl?

I am looking for something like HTML::TableExtract, just not for HTML input, but for plain text input that contains "tables" formatted with indentation and spacing.

Data could look like this:

Here is some header text.

Column One       Column Two      Column Three
a                                           b
a                    b                      c


Some more text

Another Table     Another Column
abdbdbdb          aaaa
like image 208
Thilo Avatar asked Oct 14 '10 03:10

Thilo


2 Answers

Not aware of any packaged solution, but something not very flexible is fairly simple to do assuming you can do two passes over the file: (the following is partially Perlish pseudocode example)

  • Assumption: data may contain spaces and is NOT quoted ala CSV if there's a space - if this is not the case, just use Text::CSV(_XS).
  • Assumption: no tabs used for formatting.
  • The logic defines a "column separator" to be any consecutive set of vertical rows populated 100% with spaces.
  • If by accident every row has a space which is part of the data at offset M characters, the logic will consider offset M to be a column separator, since it can't know any better. The ONLY way it can know better is if you require column separation to be at least X spaces where X>1 - see the second code fragment for that.

Sample code:

my $INFER_FROM_N_LINES = 10; # Infer columns from this # of lines
                             # 0 means from entire file
my $lines_scanned = 0;
my @non_spaces=[];
# First pass - find which character columns in the file have all spaces and which don't
my $fh = open(...) or die;
while (<$fh>) {
    last if $INFER_FROM_N_LINES && $lines_scanned++ == $INFER_FROM_N_LINES;
    chomp;
    my $line = $_;
    my @chars = split(//, $line); 
    for (my $i = 0; $i < @chars; $i++) { # Probably can be done prettier via map?
        $non_spaces[$i] = 1 if $chars[$i] ne " ";
    }
}
close $fh or die;

# Find columns, defined as consecutive "non-spaces" slices.
my @starts, @ends; # Index at which columns start and end
my $state = " "; # Not inside a column
for (my $i = 0; $i < @non_spaces; $i++) {
    next if $state eq " " && !$non_spaces[$i];
    next if $state eq "c" && $non_spaces[$i];
    if ($state eq " ") { # && $non_spaces[$i] of course => start column
        $state = "c";
        push @starts, $i;
    } else { # meaning $state eq "c" && !$non_spaces[$i] => end column
        $state = " ";
        push @ends, $i-1;
    }
}
if ($state eq "c") { # Last char is NOT a space - produce the last column end
    push @ends, $#non_spaces;
}

# Now split lines
my $fh = open(...) or die;
my @rows = ();
while (<$fh>) {
    my @columns = ();
    push @rows, \@columns;
    chomp;
    my $line = $_;
    for (my $col_num = 0; $col_num < @starts; $col_num++) {
        $columns[$col_num] = substr($_, $starts[$col_num], $ends[$col_num]-$starts[$col_num]+1);
    }
}
close $fh or die;

Now, if you require column separation to be at least X spaces where X>1, it's also doable but the parser of column locations needs to be a bit more complex :

# Find columns, defined as consecutive "non-spaces" slices separated by at least 3 spaces.
my $min_col_separator_is_X_spaces = 3;
my @starts, @ends; # Index at which columns start and end
my $state = "S"; # inside a separator
NEXT_CHAR: for (my $i = 0; $i < @non_spaces; $i++) {
    if ($state eq "S") { # done with last column, inside a separator
        if ($non_spaces[$i]) { # start a new column
            $state = "c";
            push @starts, $i;
        }
        next;
    }
    if ($state eq "c") { # Processing a column
        if (!$non_spaces[$i]) { # First space after non-space
                                # Could be beginning of separator? check next X chars!
            for (my $j = $i+1; $j < @non_spaces
                            || $j < $i+$min_col_separator_is_X_spaces; $j++) {
                 if ($non_spaces[$j]) {
                     $i = $j++; # No need to re-scan again
                     next NEXT_CHAR; # OUTER loop
                 }
                 # If we reach here, next X chars are spaces! Column ended!
                 push @ends, $i-1;
                 $state = "S";
                 $i = $i + $min_col_separator_is_X_spaces;
            }
         }
        next;
    }
}
like image 134
DVK Avatar answered Oct 29 '22 09:10

DVK


Here's a very quick solution, commented with an overview. (My apologies for the length.) Basically, if a "word" appears after the start of column header n, then it ends up in column n, unless most of its body trails into column n + 1, in which case it ends up there instead. Tidying this up, extending it to support multiple different tables, etc. are left as an exercise. You could also use something other than the left offset of the column header as the boundary mark, such as the centre, or some value determined by the column number.

#!/usr/bin/perl


use warnings;
use strict;


# Just plug your headers in here...
my @headers = ('Column One', 'Column Two', 'Column Three');

# ...and get your results as an array of arrays of strings.
my @result = ();


my $all_headers = '(' . (join ').*(', @headers) . ')';
my $found = 0;
my @header_positions;
my $line = '';
my $row = 0;
push @result, [] for (1 .. @headers);


# Get lines from file until a line matching the headers is found.

while (defined($line = <DATA>)) {

    # Get the positions of each header within that line.

    if ($line =~ /$all_headers/) {
        @header_positions = @-[1 .. @headers];
        $found = 1;
        last;
    }

}


$found or die "Table not found! :<\n";


# For each subsequent nonblank line:

while (defined($line = <DATA>)) {
    last if $line =~ /^$/;

    push @{$_}, "" for (@result);
    ++$row;

    # For each word in line:

    while ($line =~ /(\S+)/g) {

        my $word = $1;
        my $position = $-[1];
        my $length = $+[1] - $position;
        my $column = -1;

        # Get column in which word starts.

        while ($column < $#headers &&
            $position >= $header_positions[$column + 1]) {
            ++$column;
        }

        # If word is not fully within that column,
        # and more of it is in the next one, put it in the next one.

        if (!($column == $#headers ||
            $position + $length < $header_positions[$column + 1]) &&
            $header_positions[$column + 1] - $position <
            $position + $length - $header_positions[$column + 1]) {

            my $element = \$result[$column + 1]->[$row];
            $$element .= " $word";

        # Otherwise, put it in the one it started in.

        } else {

            my $element = \$result[$column]->[$row];
            $$element .= " $word";

        }

    }

}


# Output! Eight-column tabs work best for this demonstration. :P

foreach my $i (0 .. $#headers) {
    print $headers[$i] . ": ";
    foreach my $c (@{$result[$i]}) {
        print "$c\t";
    }
    print "\n";
}


__DATA__

This line ought to be ignored.

Column One       Column Two      Column Three
These lines are part of the tabular data to be processed.
The data are split based on how much words overlap columns.

This line ought to be ignored also.

Sample output:

Column One:      These lines are         The data are split
Column Two:      part of the tabular     based on how
Column Three:    data to be processed.   much words overlap columns.
like image 30
Jon Purdy Avatar answered Oct 29 '22 11:10

Jon Purdy