Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I count the "real" words in a text with Perl?

I've run into a text processing problem. I've an article, and I'd like to find out how many "real" words there are.

Here is what I mean by "real". Articles usually contain various punctuation marks such as dashes, and commas, dots, etc. What I'd like to find out is how many words there are, skipping like "-" dashes and "," commas with spaces, etc.

I tried doing this:

my @words = split ' ', $article;
print scalar @words, "\n";

But that includes various punctuations that have spaces in them as words.

So I'm thinking of using this:

my @words = grep { /[a-z0-9]/i } split ' ', $article;
print scalar @words, "\n";

This would match all words that have characters or numbers in them. What do you think, would this be good enough way to count words in an article?

Does anyone know maybe of a module on CPAN that does this?

like image 964
bodacydo Avatar asked Oct 28 '25 07:10

bodacydo


2 Answers

Try to use: \W - any non-word character, and also drop _

Solution

use strict;

my $article = 'abdc,  dd_ff,  11i-11,  ff44';

# case David's, but it didn't work with I'm or There's
$article         =~ s/\'//g; 
my $number_words = scalar (split /[\W_]+/, $article);

print $number_words;
like image 187
Pavel Vlasov Avatar answered Oct 29 '25 23:10

Pavel Vlasov


I think your solution is about as good as you're going to get without resorting to something elaborate.

You could also write it as

my @words = $article =~ /\S*\w\S*/

or count the words in a file by writing

my $n = 0;
while (<>) {
  my @words = /\S*\w\S*/g;
  $n += @words;
}

say "$n words found";

Try a few sample blocks of text and look at the list of "words" that it finds. If you are happy with that then your code works.

like image 37
Borodin Avatar answered Oct 30 '25 00:10

Borodin