Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Text::SpellChecker module and Unicode

#!/usr/local/bin/perl
use strict;
use warnings;

use Text::SpellChecker;

my $text = "coördinator";
my $checker = Text::SpellChecker->new( text => $text );

while ( my $word = $checker->next_word ) {
    print "Bad word is $word\n";
}

Output: Bad word is rdinator

Desired: Bad word is coördinator

The module is breaking if I have Unicode in $text. Any idea how can this be solved?

I have Aspell 0.50.5 installed which is being used by this module. I think this might be the culprit.

Edit: As Text::SpellChecker requires either Text::Aspell or Text::Hunspell, I removed Text::Aspell and installed Hunspell, Text::Hunspell, then:

$ hunspell -d en_US -l < badword.txt
coördinator

Shows correct result. This means there's something wrong either with my code or Text::SpellChecker.


Taking Miller's suggestion in consideration I did the below

#!/usr/local/bin/perl
use strict;
use warnings;
use Text::SpellChecker;
use utf8;
binmode STDOUT, ":encoding(utf8)";
my $text =  "coördinator";
my $flag = utf8::is_utf8($text);
print "Flag is $flag\n";
print "Text is $text\n";
my $checker = Text::SpellChecker->new(text => $text);
while (my $word = $checker->next_word) {
    print "Bad word is $word\n";
}

OUTPUT:

Flag is 1
Text is coördinator
Bad word is rdinator

Does this mean the module is not able to handle utf8 characters properly?

like image 973
Chankey Pathak Avatar asked Nov 03 '14 04:11

Chankey Pathak


2 Answers

It is Text::SpellChecker bug - the current version assumes ASCII only words.

http://cpansearch.perl.org/src/BDUGGAN/Text-SpellChecker-0.11/lib/Text/SpellChecker.pm

#
# next_word
# 
# Get the next misspelled word. 
# Returns false if there are no more.
#
sub next_word {
    ...
    while ($self->{text} =~ m/([a-zA-Z]+(?:'[a-zA-Z]+)?)/g) {

IMHO the best fix would use per language/locale word splitting regular expression or leave word splitting to underlaying library used. aspell list reports coördinator as single word.

like image 188
AnFi Avatar answered Sep 25 '22 19:09

AnFi


I've incorporated Chankey's solution and released version 0.12 to the CPAN, give it a try.

The validity of diaeresis in words like coördinator is interesting. The default aspell and hunspell dictionaries seem to mark it as incorrect, though some publications may disagree.

best, Brian

like image 24
Brian Avatar answered Sep 25 '22 19:09

Brian