Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to work with unicode in Python

I am trying to clean all of the HTML out of a string so the final output is a text file. I have some some research on the various 'converters' and am starting to lean towards creating my own dictionary for the entities and symbols and running a replace on the string. I am considering this because I want to automate the process and there is a lot of variability in the quality of the underlying html. To begin comparing the speed of my solution and one of the alternatives for example pyparsing I decided to test replace of \xa0 using the string method replace. I get a

UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 0: ordinal not in range(128)

The actual line of code was

s=unicodestring.replace('\xa0','')

Anyway-I decided that I needed to preface it with an r so I ran this line of code:

s=unicodestring.replace(r'\xa0','')

It runs without error but I when I look at a slice of s I see that the \xaO is still there

like image 943
PyNEwbie Avatar asked Apr 15 '09 18:04

PyNEwbie


People also ask

What does Unicode () do in Python?

Remarks. If encoding and/or errors are given, unicode() will decode the object which can either be an 8-bit string or a character buffer using the codec for encoding. The encoding parameter is a string giving the name of an encoding; if the encoding is not known, LookupError is raised.

How do I create a Unicode in Python?

To create an instance of unicode , you can use the unicode() built-in, or prefix a string literal with a u , like so: my_unicode = u'This is my Unicode string. ' . In Python 3, there is one and only one string type. Its name is str and it's Unicode.

How do I get the Unicode of a character in Python?

In Python, the built-in functions chr() and ord() are used to convert between Unicode code points and characters. A character can also be represented by writing a hexadecimal Unicode code point with \x , \u , or \U in a string literal.

How do I use Unicode?

To insert a Unicode character, type the character code, press ALT, and then press X. For example, to type a dollar symbol ($), type 0024, press ALT, and then press X.


1 Answers

may be you should be doing

s=unicodestring.replace(u'\xa0',u'')
like image 109
z33m Avatar answered Sep 30 '22 15:09

z33m