Capitalizing strings which contain accented characters

Question

I'm trying to find a solution for capitalising names in a perl webapp (using perl v5.10.1). I originally thought to use Lingua::EN::NameCase, but am seeing some problems with accented characters.

I need to be able to deal with accented characters from a variety of european languages (irish, french, german).

I have seen some indications online that Lingua::EN::NameCase should work for my usecase. For example, this page on perlmonks: http://www.perlmonks.org/?node_id=889135

Here is my test code based on above link:

#!/usr/bin/perl

use strict;
use warnings;
use Lingua::EN::NameCase;
use locale;
use POSIX qw(locale_h);

my $locale = 'en_FR.utf8';

setlocale( LC_CTYPE, $locale );

binmode DATA,   ':encoding(UTF-8)';
binmode STDOUT, ':encoding(UTF-8)';

while (my $original_name = <DATA>) {
    chomp $original_name;
    my $normalized_name = nc($original_name);
    printf "%30s L::EN::NC %30s UCFIRST %30s
", $original_name, $normalized_name, xlc($original_name);
}

sub xlc {
    my $str = shift;
    $_ = lc( $str );
    return join q{} => ( map { ucfirst(lc($_)) } ( $str =~ m/(\W+|\w+)/g ) );
};

__DATA__
ÉTIENNE DE LA BOÉTIE
ÉMILIE DU CHÂTELET
HÉLÈNE CIXOUS
Seán Ó Hannracháín
Máire Ó hÓgartaigh

Produces the output below. Both L::EN::NC and the custom ucfirst(lc()) solution produce incorrect results (note the capital letters following each accented character). This seems to be because perl regex is matching a "word boundary" before/after each accented character. I would have expected word boundary only to match between a space character and a non-space character.

Can anybody suggest a solution?

Thanks,

Brian.

  ÉTIENNE DE LA BOÉTIE L::EN::NC           éTienne de la BoéTie UCFIRST           ÉTienne De La BoÉTie
    ÉMILIE DU CHÂTELET L::EN::NC             éMilie du ChâTelet UCFIRST             ÉMilie Du ChÂTelet
         HÉLÈNE CIXOUS L::EN::NC                  HéLèNe Cixous UCFIRST                  HÉLÈNe Cixous
    Seán Ó Hannracháín L::EN::NC             SeáN ó HannracháíN UCFIRST             SeÁN ó HannrachÁíN
    Máire Ó hÓgartaigh L::EN::NC             MáIre ó HóGartaigh UCFIRST             MÁIre ó HÓGartaigh

JJoao · Accepted Answer

Perl 5.10 is old; you should update it, if you can.

Next you'll find a version I use for similar situations. (tested in a perl 5.14.2)

#!/usr/bin/perl

use strict;
use warnings;
use utf8::all;

while (<DATA>) { chomp;
    printf "%30s ==> %30s
", $_, xlc($_);
}

sub xlc { my $str = shift;
    $str =~ s/(\w+)/ucfirst(lc($1))/ge;
    $str =~ s/( L[ea]s?
               | Von
               | D[aeou]s?
               )\b
              /lc($1)/xge;
    return $str;
};

__DATA__
ÉTIENNE DE LA BOÉTIE
ÉMILIE DU CHÂTELET
HÉLÈNE CIXOUS
Seán Ó Hannracháín
Máire Ó hÓgartaigh

Capitalizing strings which contain accented characters

Tags:

regex

unicode

capitalization

perl

Brian Foley

1 Answers

JJoao

Recent Activity

Donate For Us

Capitalizing strings which contain accented characters

Tags:

regex

unicode

capitalization

perl

Brian Foley

1 Answers

JJoao

Related questions

Recent Activity

Donate For Us