Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PHP NumberFormatter Slovenian spellout wrong

Tags:

php

intl

I am trying to spellout an integer amount into Slovenian words (for postal declarations) using the NumberFormatter class from the intl package, but the result is completely wrong and makes no sense.

$fmt = new NumberFormatter('sl', NumberFormatter::SPELLOUT);
$fmt->format(561);

Results in "petsto šestdeset ena" while it should be "petsto enainšestdeset". Looks like baby talk instead.

In Croatian language, which is pretty similar, the result seems ok ("petsto šezdeset i jedan").

Is this a poorly done translation in PHP or is this based on my system locale? I'm on PHP 5.3.10 / Ubuntu 12.04.

EDIT:

intl is version 1.1.0, the current is 3.0.0, so maybe it has been fixed?

like image 575
Omer Sabic Avatar asked Nov 12 '13 15:11

Omer Sabic


1 Answers

Disclaimer - I don't speak Slovenian or Croatian.

It looks like there are some gaps in the patterns which the PHP extension uses for the numbers in these languages. To see what I mean, you can show the pattern used by running:

$fmt = new NumberFormatter('sl', NumberFormatter::SPELLOUT);
echo $fmt->getPattern();

If you look at the output of this, you might spot one section of the "%spellout-cardinal-masculine:" which seems to jump from about 30 to 100.

...
    21: dvaset >%spellout-cardinal-masculine>;
    30: <%spellout-cardinal-masculine<deset;
    31: <%spellout-cardinal-masculine<deset >%spellout-cardinal-masculine>;
    100: sto;
    101: sto >%spellout-cardinal-masculine>;
    200: dvjesto;
...

This means there are no rules defined for the numbers above 31 and below 100. The '61' part of the number you are outputting falls into this gap.

You can generate your own pattern to fix this - I pasted in the pattern for the en-US formatter and fiddled it a bit so it looks like this:

...
    21: dvaset >%spellout-cardinal-masculine>;
    30: <%spellout-cardinal-masculine<deset;
    31: <%spellout-cardinal-masculine<deset >%spellout-cardinal-masculine>;
    40: forty;
    41: forty->%spellout-cardinal-masculine>;
    50: fifty;
    51: fifty->%spellout-cardinal-masculine>;
    60: sixty;
    61: sixty->%spellout-cardinal-masculine>;
    70: seventy;
    71: seventy->%spellout-cardinal-masculine>;
    80: eighty;
    81: eighty->%spellout-cardinal-masculine>;
    90: ninety;
    91: ninety->%spellout-cardinal-masculine>;
    100: sto;
    101: sto >%spellout-cardinal-masculine>;
    200: dvjesto;
...

Now if I save this in a new file called sl.txt with UTF-8 encoding, I can load it into the NumberFormatter:

$pattern = file_get_contents('sl.txt')
$fmt = new NumberFormatter('sl', NumberFormatter::PATTERN_RULEBASED, $pattern);
echo($fmt->format(561));

This gives me the following output:

petsto sixty-ena

Which is wrong, of course - it's a mixture of Slovenian and English, but I think if you edit the format to be something like this:

...
    61: >%spellout-cardinal-masculine>inšestdeset;
...

As I said, I don't speak Slovenian, so you probably want to check it. But this will give you the following output:

petsto enainšestdeset

You will need to add this rule for each of the missing number blocks from 31-100. You might also want to check the ICU docs for rule based formatting to make sure you get it correct.

This is a bug, but not in PHP - if you would like to fix it then the issue is within Unicode's Common Locale Data Repository in this file. PHP's intl uses ICU which uses the CLDR data.

like image 98
madebydavid Avatar answered Nov 05 '22 19:11

madebydavid