Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python - 'ascii' codec can't decode byte \xbd in position

I'm using LXML to scrape some text from webpages. Some of the text includes fractions.

I need to get this into a float format. These fail:

ugly_fraction.encode('utf-8')  #doesn't change to usable format
ugly_fraction.replace('\xbd', '')  #throws error
ugly_freaction.encode('utf-8').replace('\xbd', '')  #throws error
like image 361
appleLover Avatar asked Jan 11 '23 20:01

appleLover


1 Answers

unicodedata.numeric:

Returns the numeric value assigned to the Unicode character unichr as float. If no such value is defined, default is returned, or, if not given, ValueError is raised.

Note that it only handles a single character, not a string. So, you still need to write the code that turns a "mixed fraction" made up of an integer and a fraction character into a float. But that's easy. For example. You just need to come up with the rule for how mixed fractions are represented in your data. For example, if pure ints, pure fractions, and ints followed by a fraction with no space in between them are the only possibilities, this works (including raising some kind of reasonable exception for all invalid cases):

def parse_mixed_fraction(s):
    if s.isdigit():
        return float(s)
    elif len(s) == 1:
        return unicodedata.numeric(s[-1])
    else:
        return float(s[:-1]) + unicodedata.numeric(s[-1])
like image 78
abarnert Avatar answered Jan 23 '23 02:01

abarnert