Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Will it ever be possible for $/ to support regexes?

Tags:

perl

To quote perlvar:

... the value of $/ is a string, not a regex. awk has to be better for something. :-)

It is not difficult to think of situations where such a feature would be useful - parsing files with variable-length records is a classic use case which I encounter many times.

So far I have never had trouble loading the entire file into memory and do a :

my @records = split /my_regex/, <> ;

but for obvious reasons this technique cannot be used in situations where available memory is inadequate. In fact, many a time there is no need for all records to be stored at the same time.

Which brings me back to $/.

I find it odd that the language has not provisioned regex support for $/. Was this done by design? Is it simply impossible to implement? What other workarounds exist that can be considered as best practices in the absence of what would be a nifty feature?

like image 903
Zaid Avatar asked Oct 03 '13 12:10

Zaid


3 Answers

It doesn't make much sense to even try. Far too often, you wouldn't be able to tell if you've reached the end of the line without reading past its end. That could be very bad in interactive situations.

For example, let's say you have the following program:

local $/ = qr/\n|\r\n?/;  # Handle Windows, Unix and old MacOS line endings.
while (1) {
   print "Please enter a command: ";
   my $cmd = <>;
   $cmd =~ s{$/\z}{};
   process($cmd);
}

Looks pretty straightforward, right? In fact, supporting qr/\n|\r\n?/ is probably the number one reason for this request. Well, even that simple code is severely flawed. Let's say I use MacOS line endings (CR, ^M, \r)

 $ processor
 Please enter a command: foo^M
 [hangs]

The program hangs because it can't tell whether I gave it a MacOS line ending (CR, ^M, \r) or a Windows line ending (CRLF, ^M^J, \r\n) until another character is typed.

I'd have to enter a second command to process the first, a third command to process the second, etc. It just makes no sense.

like image 138
ikegami Avatar answered Sep 25 '22 10:09

ikegami


One of the biggest problems I can see is that supporting a regex record separator in general requires the entire contents of the file to be scanned.

Suppose, for instance, that, for whatever reason, you had specified a separator of /\n[^X]+\z/. The whole file would need to be read to check whether there were any X characters after each newline.

So there are three options that I can think of:

  • Buffering the whole file just to scan for record separators

  • Implementing regular expressions on a "paged" string so that the file can be read in parts

  • Implementing a subset of the standard regular expressions for use as record separators

None of these is a particularly attractive prospect from the implementation point of view, and I can see that I would avoid doing it if possible, especially as the first option is available to the Perl coder through the use of split.

like image 29
Borodin Avatar answered Sep 21 '22 10:09

Borodin


The (backtracking) implementation of the Perl regex engine is fundamentally incompatible with the usage as a line ending. Part of this problem is that you don't want to rerun the whole regex when the next character is read. For example, take the regex

$/ = qr/ A \w*? B | XY/;

And the data stream

f o o A 1 2 X Y B b a r

So when should the readline return? If we do incremental matching, we might get something like

f o o A 1 2 X Y B b a r
      A\w\w\w\w B

#=> fooA12XYB

If we re-run the whole regex at each position, we get

f o o A 1 2 X Y B b a r

      A *FAIL
      *FAIL

      A\w *FAIL
      *FAIL

      A\w\w *FAIL
      *FAIL

      A\w\w\w *FAIL
            X *FAIL

      A\w\w\w\w *FAIL
            X Y

#=> fooA12XY

In other words, alternations (with precedence) make this matching complicated. If the regex engine were not backtracking (but would rather run as a table parser or state machine), there would be no difference between rerunning the whole regex, or doing incremental matching. However, regex engines where this is possible are less expressive than Perl regexes.

Another problem would be the line ending

$/ = qr/ .+ /xs;

Should reading such a “line” return just the next character (because the regex is already satisfied after one character), or the whole file (because .* wants to match as much as possible)? Or should the rest of the internal buffer be returned, whatever it currently contains?

To use regexes for line endings, these ambiguities have to be adressed, and additional limitations would have to be imposed (e.g. only regular languages allowed).

like image 28
amon Avatar answered Sep 21 '22 10:09

amon