I'm using LXML to scrape some text from webpages. Some of the text includes fractions.
5½
I need to get this into a float format. These fail:
ugly_fraction.encode('utf-8') #doesn't change to usable format
ugly_fraction.replace('\xbd', '') #throws error
ugly_freaction.encode('utf-8').replace('\xbd', '') #throws error
unicodedata.numeric
:
Returns the numeric value assigned to the Unicode character unichr as float. If no such value is defined, default is returned, or, if not given, ValueError is raised.
Note that it only handles a single character, not a string. So, you still need to write the code that turns a "mixed fraction" made up of an integer and a fraction character into a float. But that's easy. For example. You just need to come up with the rule for how mixed fractions are represented in your data. For example, if pure ints, pure fractions, and ints followed by a fraction with no space in between them are the only possibilities, this works (including raising some kind of reasonable exception for all invalid cases):
def parse_mixed_fraction(s):
if s.isdigit():
return float(s)
elif len(s) == 1:
return unicodedata.numeric(s[-1])
else:
return float(s[:-1]) + unicodedata.numeric(s[-1])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With