Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parsing unicode 'Vulgar Fractions' into double in Java

I am scraping some data of a web site and parts of it include fractions in unicode e.g. 6' 5¼". I have successfully used the regex (\\d)' (\\d{1,2}([\\xbc-\\xbe])?)\" to extract each part of the String.

This gives me two strings, one is "6" and the other is "5¼".

The troublesome part is the bit that contains the unicode vulgar fractions. Obviously it does not parse correctly using Double.parseDouble.

I have looked everywhere for examples for Java but have been unable to find any, how would I go about getting ¼ out as 0.25?

If it makes it easier I can split the regex up again so it returns the fraction part seperately so I get three strings out instead of two.

like image 766
Matt Williams Avatar asked Dec 19 '22 11:12

Matt Williams


1 Answers

There is a way to do it without having your own table of vulgar fractions to values, you can use the one built into the Unicode data.

If you convert to Unicode Normalization Form KD, it'll decompose fractions into a fraction-slash (U+2044) surrounded by plain numbers. So you could do something like:

String[] fraction = Normalizer.normalize("¼", Normalizer.Form.NFKD).split("\u2044");
if (fraction.length == 2) {
    double value = (double) Integer.parseInt(fraction[0]) / Integer.parseInt(fraction[1]);
}

There are more fraction characters than the ones in the range U+00BC–U+00BE, for example ⅛, so if you want to avoid hard-coding that range I'd suggest changing the regexp for it to something like [^\d]?.

like image 131
bobince Avatar answered Dec 30 '22 22:12

bobince