Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex for matching accent characters

Tags:

string

regex

perl

Aim: I want to separate words to count their frequency in a document and then do some calculations on those frequencies.

The words can begin/contain/end with any of the following:

  • numbers
  • alphabets (including é, ú, ó etc but not symbols like $,#,& etc)

The words can contain (but not begin or end with)

  • underscore (eg: rishi_dua)
  • single quote (eg: can't)
  • hyphen (eg: 123-)

The words can be separated by any symbol or whitespace like $, #, &, tab character

Problem:

  1. I'm not able to find out how to match é, ú, ó etc without matching other special characters.
  2. What would be a more efficient way to do this (optional)
  3. Splitting by space is working for me at the moment as there is no other

What I've tried:

Approach: First I replace everything except \w (alphanumeric plus "_"), ' and - with a space Then I remove ', _ and ' if it is found at the beginning or end of a word Finally I replace multiple spaces with single space and split the words

Code: I am using a series of regex replace as follows:

$str =~ s/[^\w'-]/ /g;
#Also tried using $str =~ s/[^:alpha:0-9_'-]/ /g; but doesn't work
$str =~ s/- / /;
$str =~ s/' / /;
$str =~ s/_ / /;
$str =~ s/ -/ /;
$str =~ s/ '/ /;
$str =~ s/ _/ /;

$str =~ s/ +/ /;
foreach $word (split(' ', lc $str)) {
    #do something
}

Constraints: I have to do it in Perl (since this is a part of a larger code I've writen in Perl) but I can use other options apart from Regex

like image 675
Rishi Dua Avatar asked Jul 05 '13 02:07

Rishi Dua


1 Answers

You can use the \p{L} character class that matches all letters. and use \P{L} that matches all that is not a letter.

To allow quote and hyphen you can use :

\p{L}[\p{L}'_-]* or \p{L}+(?:['_-]\p{L}+)* to avoid non-letters at the bounds.

Notice: some accented characters are figured with several code points, for instance, even if a code point exists for à (a grave), it can also be made with two code points: the ascii letter a and the combining character ` (grave accent). \p{L}\p{Mn}* can match these kind of glyphs:

(?>\p{L}\p{Mn}*)+(?:['_-](?>\p{L}\p{Mn}*)+)*

Using a split method is more hazardous and difficult IMO, in particular if you want to deal with combining characters. But basically to match the separators you can use :

[^\p{L}\p{Mn}'_-]+

Or to be more explicit:

[^\p{L}\p{Mn}'_-]+|(?<![\p{L}\p{Mn}])['_-]+|[-_']+(?!\p{L}) that split on hyphens and quotes that are not surrounded by letters.

like image 175
Casimir et Hippolyte Avatar answered Oct 16 '22 08:10

Casimir et Hippolyte