Python

Question

I'm using LXML to scrape some text from webpages. Some of the text includes fractions.

5½

I need to get this into a float format. These fail:

ugly_fraction.encode('utf-8')  #doesn't change to usable format
ugly_fraction.replace('\xbd', '')  #throws error
ugly_freaction.encode('utf-8').replace('\xbd', '')  #throws error

abarnert · Accepted Answer

unicodedata.numeric:

Returns the numeric value assigned to the Unicode character unichr as float. If no such value is defined, default is returned, or, if not given, ValueError is raised.

Note that it only handles a single character, not a string. So, you still need to write the code that turns a "mixed fraction" made up of an integer and a fraction character into a float. But that's easy. For example. You just need to come up with the rule for how mixed fractions are represented in your data. For example, if pure ints, pure fractions, and ints followed by a fraction with no space in between them are the only possibilities, this works (including raising some kind of reasonable exception for all invalid cases):

def parse_mixed_fraction(s):
    if s.isdigit():
        return float(s)
    elif len(s) == 1:
        return unicodedata.numeric(s[-1])
    else:
        return float(s[:-1]) + unicodedata.numeric(s[-1])

Python - 'ascii' codec can't decode byte \xbd in position

Tags:

unicode

web-scraping

lxml

appleLover

1 Answers

abarnert

Recent Activity

Donate For Us