Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to convert unicode numbers to ints?

Arabic and Chinese have their own glyphs for digits. int works correctly with all the different ways to write numbers.

I was not able to reproduce the behaviour (python 3.5.0)

>>> from unicodedata import name
>>> name('𐹤')
'RUMI DIGIT FIVE'
>>> int('𐹤')
ValueError: invalid literal for int() with base 10: '𐹤'
>>> int('五')  # chinese/japanese number five
ValueError: invalid literal for int() with base 10: '五'

Am I doing something wrong? Or is the claim simply incorrect (source).

like image 280
wim Avatar asked Sep 26 '16 18:09

wim


2 Answers

Here's a way to convert to numerical values (casting to int does not work in all cases, unless there's a secret setting somewhere)

from unicodedata import numeric
print(numeric('五'))

result: 5.0

Someone noted (and was right) that some arabic or other chars worked fine with int, so a routine with a fallback mechanism could be done:

from unicodedata import numeric

def to_integer(s):
    try:
        r = int(s)
    except ValueError:
        r = int(numeric(s))
    return r

EDIT: as zvone noted, there are fraction characters that return floating point numbers: ex: numeric('\u00be') is 0.75 (3/4 char). So rounding to int is not always safe.

EDIT2: the numeric function only accepts one character. So the "conversion to numeric" that could handle most cases without risks of rounding would be

from unicodedata import numeric

def to_float(s):
    try:
        r = float(s)
    except ValueError:
        r = numeric(s)
    return r

print(to_float('۵۵'))
print(to_float('五'))
print(to_float('¾'))

result:

55.0
5.0
0.75

(I don't want to steal user2357112 excellent explanation, but still wanted to provide a solution that tries to cover all cases)

like image 42
Jean-François Fabre Avatar answered Sep 21 '22 02:09

Jean-François Fabre


int does not accept all ways to write numbers. It understands digit characters used for positional numeral systems, but neither Rumi nor Chinese numerals are positional. Neither '五五' nor two copies of Rumi numeral 5 would represent 55, so int doesn't accept them.

like image 131
user2357112 supports Monica Avatar answered Sep 18 '22 02:09

user2357112 supports Monica