Why do the perl version of these simple implementations of the unix "wc" utility produce "the wrong" answer?
Perl:
#!/usr/bin/perl -w
use strict;
my $lines=0;
my $words=0;
my $bytes=0;
while (<>) {
$lines++;
$bytes += length;
#chomp; # chomp does not make a difference
$words += split /\s+/; # split on white spaces
}
print "$lines $words $bytes\n";
Python:
#!/usr/bin/python3
import fileinput
lines = 0
words = 0
bytes = 0
for line in fileinput.input():
lines += 1
bytes += len(line)
words += len(line.strip().split()) # Python split on white spaces by default
print(f"{lines} {words} {bytes}")
When giving it a 20k-ish C-source file, I get the following results
OUTPUT:
cat source.c | ./wc.pl
19681 62506 660235
cat source.c | ./pywc.py
19681 46643 660235
cat source.c | wc
19681 46643 660235
So the "wc" utility agrees with python.
My suspicion is that perl's idea of white spaces are different from the rest? (Or more likely: I am missing something with the perl split command)
perl -v
This is perl 5, version 30, subversion 0 (v5.30.0) built for x86_64-linux-gnu-thread-multi
(with 60 registered patches, see perl -V for more detail)
The comment of TPL basically solved the problem.
When changing split /\s+/ to just split, the answer agrees with the other tools.
Turns out, as pointed out by TPL again, that split without any parameters will trim leading white spaces, while it won't when called with a parameter like /\s+/
The perldoc states:
As another special case, "split" emulates the default behavior of the command line tool awk when the PATTERN is either omitted or a string composed of a single space character (such as ' ' or "\x20", but not e.g. "/ /"). In this case, any leading whitespace in EXPR is removed before splitting occurs, and the PATTERN is instead treated as if it were "/\s+/"; in particular, this means that any contiguous whitespace (not just a single space character) is used as a separator.
The full solution is therefore:
#!/usr/bin/perl -w
use strict;
my $lines=0;
my $words=0;
my $bytes=0;
while (<>) {
$lines++;
$bytes+=length;
$words+=split; # split /\s+/ does not remove leading WS
}
print "$lines $words $bytes\n";
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With