I'm trying to find a solution for capitalising names in a perl webapp (using perl v5.10.1). I originally thought to use Lingua::EN::NameCase, but am seeing some problems with accented characters.
I need to be able to deal with accented characters from a variety of european languages (irish, french, german).
I have seen some indications online that Lingua::EN::NameCase should work for my usecase. For example, this page on perlmonks: http://www.perlmonks.org/?node_id=889135
Here is my test code based on above link:
#!/usr/bin/perl
use strict;
use warnings;
use Lingua::EN::NameCase;
use locale;
use POSIX qw(locale_h);
my $locale = 'en_FR.utf8';
setlocale( LC_CTYPE, $locale );
binmode DATA, ':encoding(UTF-8)';
binmode STDOUT, ':encoding(UTF-8)';
while (my $original_name = <DATA>) {
chomp $original_name;
my $normalized_name = nc($original_name);
printf "%30s L::EN::NC %30s UCFIRST %30s\n", $original_name, $normalized_name, xlc($original_name);
}
sub xlc {
my $str = shift;
$_ = lc( $str );
return join q{} => ( map { ucfirst(lc($_)) } ( $str =~ m/(\W+|\w+)/g ) );
};
__DATA__
ÉTIENNE DE LA BOÉTIE
ÉMILIE DU CHÂTELET
HÉLÈNE CIXOUS
Seán Ó Hannracháín
Máire Ó hÓgartaigh
Produces the output below. Both L::EN::NC and the custom ucfirst(lc()) solution produce incorrect results (note the capital letters following each accented character). This seems to be because perl regex is matching a "word boundary" before/after each accented character. I would have expected word boundary only to match between a space character and a non-space character.
Can anybody suggest a solution?
Thanks,
Brian.
ÉTIENNE DE LA BOÉTIE L::EN::NC éTienne de la BoéTie UCFIRST ÉTienne De La BoÉTie
ÉMILIE DU CHÂTELET L::EN::NC éMilie du ChâTelet UCFIRST ÉMilie Du ChÂTelet
HÉLÈNE CIXOUS L::EN::NC HéLèNe Cixous UCFIRST HÉLÈNe Cixous
Seán Ó Hannracháín L::EN::NC SeáN ó HannracháíN UCFIRST SeÁN ó HannrachÁíN
Máire Ó hÓgartaigh L::EN::NC MáIre ó HóGartaigh UCFIRST MÁIre ó HÓGartaigh
Perl 5.10 is old; you should update it, if you can.
Next you'll find a version I use for similar situations. (tested in a perl 5.14.2)
#!/usr/bin/perl
use strict;
use warnings;
use utf8::all;
while (<DATA>) { chomp;
printf "%30s ==> %30s\n", $_, xlc($_);
}
sub xlc { my $str = shift;
$str =~ s/(\w+)/ucfirst(lc($1))/ge;
$str =~ s/( L[ea]s?
| Von
| D[aeou]s?
)\b
/lc($1)/xge;
return $str;
};
__DATA__
ÉTIENNE DE LA BOÉTIE
ÉMILIE DU CHÂTELET
HÉLÈNE CIXOUS
Seán Ó Hannracháín
Máire Ó hÓgartaigh
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With