Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parse ½ as 0.5 in Python 2.7

I am scraping this link with BeautifulSoup4

I am parsing page HTML like this

page = BeautifulSoup(page.replace('ISO-8859-1', 'utf-8'),"html5lib")

You can see the values like these -4 -115 (separated by -)

I want both values in a list so I am using this regex.

value = re.findall(r'[+-]?\d+', value)

It works perfectly but not for these values +2½ -102, I only get [-102]

To tackle this, I tried this too

value = value.replace("½","0.5")
value = re.findall(r'[+-]?\d+', value)

but this gives me error about encoding saying I have to set encoding of my file.

I also tried setting encoding=utf-8 at top of file but still gives same error.

I need to ask how do I convert ½ to 0.5

like image 647
Umair Ayub Avatar asked Jan 26 '16 11:01

Umair Ayub


People also ask

How do you write half in Python?

The len() function here is used to return the length of the string. We split the string into one half containing the first half of the characters and the second substring containing the other half. We use the // operator to divide the length of the string because it performs floor division, and an integer is returned.


1 Answers

To embed Unicode literals like ½ in your Python 2 script you need to use a special comment at the top of your script that lets the interpreter know how the Unicode has been encoded. If you want to use UTF-8 you will also need to tell your editor to save the file as UTF-8. And if you want to print Unicode text make sure your terminal is set to use UTF-8, too.

Here's a short example, tested on Python 2.6.6

# -*- coding: utf-8 -*-

value = "a string with fractions like 2½ in it"
value = value.replace("½",".5")
print(value)

output

a string with fractions like 2.5 in it

Note that I'm using ".5" as the replacement string; using "0.5" would convert "2½" to "20.5", which would not be correct.


Actually, those strings should be marked as Unicode strings, like this:

# -*- coding: utf-8 -*-

value = u"a string with fractions like 2½ in it"
value = value.replace(u"½", u".5")
print(value)

For further information on using Unicode in Python, please see Pragmatic Unicode, which was written by SO veteran Ned Batchelder.


I should also mention that you will need to change your regex pattern so that it allows a decimal point in numbers. Eg:

# -*- coding: utf-8 -*-
from __future__ import print_function
import re

pat = re.compile(r'[-+]?(?:\d*?[.])?\d+', re.U) 

data = u"+2½ -105 -2½ -115 +2½ -105 -2½ -115 +2½ -102 -2½ -114"
print(data)
print(pat.findall(data.replace(u"½", u".5")))

output

+2½ -105 -2½ -115 +2½ -105 -2½ -115 +2½ -102 -2½ -114
[u'+2.5', u'-105', u'-2.5', u'-115', u'+2.5', u'-105', u'-2.5', u'-115', u'+2.5', u'-102', u'-2.5', u'-114']
like image 155
PM 2Ring Avatar answered Sep 28 '22 09:09

PM 2Ring