To quote perlvar
:
... the value of
$/
is a string, not a regex.awk
has to be better for something. :-)
It is not difficult to think of situations where such a feature would be useful - parsing files with variable-length records is a classic use case which I encounter many times.
So far I have never had trouble loading the entire file into memory and do a :
my @records = split /my_regex/, <> ;
but for obvious reasons this technique cannot be used in situations where available memory is inadequate. In fact, many a time there is no need for all records to be stored at the same time.
Which brings me back to $/
.
I find it odd that the language has not provisioned regex support for $/
. Was this done by design? Is it simply impossible to implement? What other workarounds exist that can be considered as best practices in the absence of what would be a nifty feature?
It doesn't make much sense to even try. Far too often, you wouldn't be able to tell if you've reached the end of the line without reading past its end. That could be very bad in interactive situations.
For example, let's say you have the following program:
local $/ = qr/\n|\r\n?/; # Handle Windows, Unix and old MacOS line endings.
while (1) {
print "Please enter a command: ";
my $cmd = <>;
$cmd =~ s{$/\z}{};
process($cmd);
}
Looks pretty straightforward, right? In fact, supporting qr/\n|\r\n?/
is probably the number one reason for this request. Well, even that simple code is severely flawed. Let's say I use MacOS line endings (CR, ^M, \r)
$ processor
Please enter a command: foo^M
[hangs]
The program hangs because it can't tell whether I gave it a MacOS line ending (CR, ^M, \r) or a Windows line ending (CRLF, ^M^J, \r\n) until another character is typed.
I'd have to enter a second command to process the first, a third command to process the second, etc. It just makes no sense.
One of the biggest problems I can see is that supporting a regex record separator in general requires the entire contents of the file to be scanned.
Suppose, for instance, that, for whatever reason, you had specified a separator of /\n[^X]+\z/
. The whole file would need to be read to check whether there were any X
characters after each newline.
So there are three options that I can think of:
Buffering the whole file just to scan for record separators
Implementing regular expressions on a "paged" string so that the file can be read in parts
Implementing a subset of the standard regular expressions for use as record separators
None of these is a particularly attractive prospect from the implementation point of view, and I can see that I would avoid doing it if possible, especially as the first option is available to the Perl coder through the use of split
.
The (backtracking) implementation of the Perl regex engine is fundamentally incompatible with the usage as a line ending. Part of this problem is that you don't want to rerun the whole regex when the next character is read. For example, take the regex
$/ = qr/ A \w*? B | XY/;
And the data stream
f o o A 1 2 X Y B b a r
So when should the readline
return? If we do incremental matching, we might get something like
f o o A 1 2 X Y B b a r
A\w\w\w\w B
#=> fooA12XYB
If we re-run the whole regex at each position, we get
f o o A 1 2 X Y B b a r
A *FAIL
*FAIL
A\w *FAIL
*FAIL
A\w\w *FAIL
*FAIL
A\w\w\w *FAIL
X *FAIL
A\w\w\w\w *FAIL
X Y
#=> fooA12XY
In other words, alternations (with precedence) make this matching complicated. If the regex engine were not backtracking (but would rather run as a table parser or state machine), there would be no difference between rerunning the whole regex, or doing incremental matching. However, regex engines where this is possible are less expressive than Perl regexes.
Another problem would be the line ending
$/ = qr/ .+ /xs;
Should reading such a “line” return just the next character (because the regex is already satisfied after one character), or the whole file (because .*
wants to match as much as possible)? Or should the rest of the internal buffer be returned, whatever it currently contains?
To use regexes for line endings, these ambiguities have to be adressed, and additional limitations would have to be imposed (e.g. only regular languages allowed).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With