I am reading a source that contains the special character ½. How do I convert this to 1/2? The character is part of a sentence and I still need to be able to use this string "normally". I am reading webpage sources, so I'm not sure that I will always know the encoding??
Edit: I have tried looking at other answers, but they don't work for me. They always seem to start with something like:
s= u'£10"
but I get an error already there: "no encoding declared". But do I know what encoding I'm getting in, or does that not matter? Do I just pick one?
This is really two questions.
#1. To interpret ½: Use the unicodedata module. You can ask for the numeric value of the character or you can normalize using a canonical normalization form it and parse it yourself.
>>> import unicodedata
>>> unicodedata.numeric(u'½')
0.5
>>> unicodedata.normalize('NFKC', u'½')
'1⁄2'
#2. Encoding problems: If you're working with the terminal, make sure Python knows the terminal encoding. If you're writing source files, make sure Python knows the file encoding. You can't just "pick" an encoding to set for Python, you must inform Python about the encoding that your terminal / text editor already uses.
Python lets you set the encoding of files with Vim/Emacs style comments. Put a comment at the top of the file like this if you use Vim:
# coding=UTF-8
Or this, if you use Emacs:
# -*- coding: UTF-8 -*-
If you use neither Vim nor Emacs, then it doesn't matter which one. Obviously, if you don't use UTF-8 you should substitute the encoding you actually use. (UTF-8 is the only encoding I can recommend.)
Dietrich beat me to the punch, but here is some more detail about setting the encoding for your source file:
Because you want to search for a literal unicode ½, you need to be able to write it in your source file. Unfortunately, the Python interpreter chokes on any unicode input, unless you specify the encoding of that source file with a comment in the first couple of lines, like so:
# coding=utf8
# ... do stuff here ...
This assumes your editor is saving the file as UTF-8. If it's using a different encoding specify that instead. See PEP-0263 for more details.
Once you've specified the encoding you should be able to write something this in your code:
text = text.replace('½', '1/2')
Depending on how you are downloading the page, you probably don't need to worry about this at all, most HTTP libraries handle choosing the encoding for you automatically.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With