Aim: I want to separate words to count their frequency in a document and then do some calculations on those frequencies.
The words can begin/contain/end with any of the following:
The words can contain (but not begin or end with)
The words can be separated by any symbol or whitespace like $, #, &, tab character
Problem:
What I've tried:
Approach: First I replace everything except \w (alphanumeric plus "_"), ' and - with a space Then I remove ', _ and ' if it is found at the beginning or end of a word Finally I replace multiple spaces with single space and split the words
Code: I am using a series of regex replace as follows:
$str =~ s/[^\w'-]/ /g;
#Also tried using $str =~ s/[^:alpha:0-9_'-]/ /g; but doesn't work
$str =~ s/- / /;
$str =~ s/' / /;
$str =~ s/_ / /;
$str =~ s/ -/ /;
$str =~ s/ '/ /;
$str =~ s/ _/ /;
$str =~ s/ +/ /;
foreach $word (split(' ', lc $str)) {
#do something
}
Constraints: I have to do it in Perl (since this is a part of a larger code I've writen in Perl) but I can use other options apart from Regex
You can use the \p{L}
character class that matches all letters. and use \P{L}
that matches all that is not a letter.
To allow quote and hyphen you can use :
\p{L}[\p{L}'_-]*
or \p{L}+(?:['_-]\p{L}+)*
to avoid non-letters at the bounds.
Notice: some accented characters are figured with several code points, for instance, even if a code point exists for à
(a grave), it can also be made with two code points: the ascii letter a
and the combining character ` (grave accent). \p{L}\p{Mn}*
can match these kind of glyphs:
(?>\p{L}\p{Mn}*)+(?:['_-](?>\p{L}\p{Mn}*)+)*
Using a split method is more hazardous and difficult IMO, in particular if you want to deal with combining characters. But basically to match the separators you can use :
[^\p{L}\p{Mn}'_-]+
Or to be more explicit:
[^\p{L}\p{Mn}'_-]+|(?<![\p{L}\p{Mn}])['_-]+|[-_']+(?!\p{L})
that split on hyphens and quotes that are not surrounded by letters.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With