Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why do Perl "split /\s+/" yield the wrong word count?

Tags:

perl

Why do the perl version of these simple implementations of the unix "wc" utility produce "the wrong" answer?

Perl:

#!/usr/bin/perl -w

use strict;

my $lines=0;
my $words=0;
my $bytes=0;

while (<>) {
        $lines++;
        $bytes += length;
        #chomp;              # chomp does not make a difference
        $words += split /\s+/; # split on white spaces
}
print "$lines $words $bytes\n";

Python:

#!/usr/bin/python3

import fileinput

lines = 0
words = 0
bytes = 0

for line in fileinput.input():
    lines += 1
    bytes += len(line)
    words += len(line.strip().split()) # Python split on white spaces by default

print(f"{lines} {words} {bytes}")

When giving it a 20k-ish C-source file, I get the following results

OUTPUT:
cat source.c | ./wc.pl
19681 62506 660235

cat source.c | ./pywc.py
19681 46643 660235

cat source.c | wc
19681 46643 660235

So the "wc" utility agrees with python.

My suspicion is that perl's idea of white spaces are different from the rest? (Or more likely: I am missing something with the perl split command)

perl -v

This is perl 5, version 30, subversion 0 (v5.30.0) built for x86_64-linux-gnu-thread-multi
(with 60 registered patches, see perl -V for more detail)
like image 709
Paul Schutte Avatar asked Dec 28 '25 21:12

Paul Schutte


1 Answers

The comment of TPL basically solved the problem.

When changing split /\s+/ to just split, the answer agrees with the other tools.

Turns out, as pointed out by TPL again, that split without any parameters will trim leading white spaces, while it won't when called with a parameter like /\s+/

The perldoc states:

As another special case, "split" emulates the default behavior of the command line tool awk when the PATTERN is either omitted or a string composed of a single space character (such as ' ' or "\x20", but not e.g. "/ /"). In this case, any leading whitespace in EXPR is removed before splitting occurs, and the PATTERN is instead treated as if it were "/\s+/"; in particular, this means that any contiguous whitespace (not just a single space character) is used as a separator.

The full solution is therefore:

#!/usr/bin/perl -w
use strict;

my $lines=0;
my $words=0;
my $bytes=0;

while (<>) {
        $lines++;
        $bytes+=length;
        $words+=split;   # split /\s+/ does not remove leading WS
}
print "$lines $words $bytes\n";
like image 94
Paul Schutte Avatar answered Dec 31 '25 13:12

Paul Schutte