I've run into a text processing problem. I've an article, and I'd like to find out how many "real" words there are.
Here is what I mean by "real". Articles usually contain various punctuation marks such as dashes, and commas, dots, etc. What I'd like to find out is how many words there are, skipping like "-" dashes and "," commas with spaces, etc.
I tried doing this:
my @words = split ' ', $article;
print scalar @words, "\n";
But that includes various punctuations that have spaces in them as words.
So I'm thinking of using this:
my @words = grep { /[a-z0-9]/i } split ' ', $article;
print scalar @words, "\n";
This would match all words that have characters or numbers in them. What do you think, would this be good enough way to count words in an article?
Does anyone know maybe of a module on CPAN that does this?
Try to use: \W - any non-word character, and also drop _
Solution
use strict;
my $article = 'abdc, dd_ff, 11i-11, ff44';
# case David's, but it didn't work with I'm or There's
$article =~ s/\'//g;
my $number_words = scalar (split /[\W_]+/, $article);
print $number_words;
I think your solution is about as good as you're going to get without resorting to something elaborate.
You could also write it as
my @words = $article =~ /\S*\w\S*/
or count the words in a file by writing
my $n = 0;
while (<>) {
my @words = /\S*\w\S*/g;
$n += @words;
}
say "$n words found";
Try a few sample blocks of text and look at the list of "words" that it finds. If you are happy with that then your code works.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With