Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Read perl file handle with $INPUT_RECORD_SEPARATOR as a regex

Tags:

perl

I'm looking for a way to read from a file handle line by line (and then execute a function on each line) with the following twist: what I want to treat as a "line" shall be terminated by varying characters and not just a single character that I define as $/. I now that $INPUT_RECORD_SEPARATOR or $/ do not support regular expressions or passing a list of characters to be treated as line terminators and this is where my problem lies.

My file handle comes from stdout of a process. Thus, I cannot seek inside the file handle and the full content is not available immediately but is produced bit by bit as the process is executed. I want to be able to attach things like a timestamp to each "line" the process produces using a function that I called handler in my examples. Each line should be handled as soon as it gets produced by the program.

Unfortunately, I can only come up with a way that either executes the handler function immediately but seems horribly inefficient or a way that uses a buffer but will only lead to "grouped" calls of the handler function and thus, for example, produce wrong timestamps.

In fact, in my specific case, my regex would even be very simple and just read /\n|\r/. So for this particular problem I don't even need full regex support but just the possibility to treat more than one character as the line terminator. But $/ doesn't support this.

Is there an efficient way to solve this problem in Perl?

Here is some quick pseudo-perl code to demonstrate my two approaches:

read the input file handle byte-by-byte

This would look like this:

my $acc = "";
while (read($fd, my $b, 1)) {
    $acc .= $b;
    if ($acc =~ /someregex$/) {
        handler($acc);
        $acc = "";
    }
}

The advantage here is, that handler gets immediately dispatched once enough bytes are read. The disadvantage is, that we do string appending and check the regex for every single byte we read from $fd.

read the input file handle with blocks of X-byte at a time

This would look like this:

my $acc = "";
while (read($fd, my $b, $bufsize)) {
    if ($b =~ /someregex/) {
        my @parts = split /someregex/, $b;
        # for brevity lets assume we always get more than 2 parts...
        my $first = shift @parts;
        handler(acc . $first);
        my $last = pop @parts;
        foreach my $part (@parts) {
            handler($part);
        }
        $acc = $last;
    }
}

The advantage here is, that we are more efficient as we only check every $bufsize bytes. The disadvantage is, that the execution of handler has to wait until $bufsize bytes have been read.

like image 965
josch Avatar asked Oct 30 '22 19:10

josch


1 Answers

Setting $INPUT_RECORD_SEPARATOR to a regex wouldn't help, because Perl's readline uses buffered IO, too. The trick is to use your second approach but with unbuffered sysread instead of read. If you sysread from a pipe, the call will return as soon as data is available, even if the whole buffer couldn't be filled (at least on Unix).

like image 157
nwellnhof Avatar answered Nov 15 '22 06:11

nwellnhof