How do I count the "real" words in a text with Perl?

Question

I've run into a text processing problem. I've an article, and I'd like to find out how many "real" words there are.

Here is what I mean by "real". Articles usually contain various punctuation marks such as dashes, and commas, dots, etc. What I'd like to find out is how many words there are, skipping like "-" dashes and "," commas with spaces, etc.

I tried doing this:

my @words = split ' ', $article;
print scalar @words, "
";

But that includes various punctuations that have spaces in them as words.

So I'm thinking of using this:

my @words = grep { /[a-z0-9]/i } split ' ', $article;
print scalar @words, "
";

This would match all words that have characters or numbers in them. What do you think, would this be good enough way to count words in an article?

Does anyone know maybe of a module on CPAN that does this?

Pavel Vlasov · Accepted Answer

Try to use: \W - any non-word character, and also drop _

Solution

use strict;

my $article = 'abdc,  dd_ff,  11i-11,  ff44';

# case David's, but it didn't work with I'm or There's
$article         =~ s/\'//g; 
my $number_words = scalar (split /[\W_]+/, $article);

print $number_words;

Borodin · Answer

I think your solution is about as good as you're going to get without resorting to something elaborate.

You could also write it as

my @words = $article =~ /\S*\w\S*/

or count the words in a file by writing

my $n = 0;
while (<>) {
  my @words = /\S*\w\S*/g;
  $n += @words;
}

say "$n words found";

Try a few sample blocks of text and look at the list of "words" that it finds. If you are happy with that then your code works.

How do I count the "real" words in a text with Perl?

Tags:

text

perl

text-processing

bodacydo

2 Answers

Pavel Vlasov

Borodin

Recent Activity

Donate For Us

How do I count the "real" words in a text with Perl?

Tags:

text

perl

text-processing

bodacydo

2 Answers

Pavel Vlasov

Borodin

Related questions

Recent Activity

Donate For Us