Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to make the python interpreter correctly handle non-ASCII characters in string operations?

Tags:

python

unicode

I have a string that looks like so:

6 918 417 712 

The clear cut way to trim this string (as I understand Python) is simply to say the string is in a variable called s, we get:

s.replace('Â ', '') 

That should do the trick. But of course it complains that the non-ASCII character '\xc2' in file blabla.py is not encoded.

I never quite could understand how to switch between different encodings.

Here's the code, it really is just the same as above, but now it's in context. The file is saved as UTF-8 in notepad and has the following header:

#!/usr/bin/python2.4 # -*- coding: utf-8 -*- 

The code:

f = urllib.urlopen(url)  soup = BeautifulSoup(f)  s = soup.find('div', {'id':'main_count'})  #making a print 's' here goes well. it shows 6Â 918Â 417Â 712  s.replace('Â ','')  save_main_count(s) 

It gets no further than s.replace...

like image 434
adergaard Avatar asked Aug 27 '09 15:08

adergaard


People also ask

How do I allow non-ASCII characters in Python?

In order to use non-ASCII characters, Python requires explicit encoding and decoding of strings into Unicode. In IBM® SPSS® Modeler, Python scripts are assumed to be encoded in UTF-8, which is a standard Unicode encoding that supports non-ASCII characters.

How do you avoid non-ASCII characters in Python?

In python, to remove non-ASCII characters in python, we need to use string. encode() with encoding as ASCII and error as ignore, to returns a string without ASCII character use string. decode().

How we get the ASCII value of any character in Python programming language?

To get ascii value of char python, the ord () method is used. It is in-built in the library of character methods provided by Python. ASCII or American Standard Code for Information Interchange is the numeric value that is given to different characters and symbols.


1 Answers

Throw out all characters that can't be interpreted as ASCII:

def remove_non_ascii(s):     return "".join(c for c in s if ord(c)<128) 

Keep in mind that this is guaranteed to work with the UTF-8 encoding (because all bytes in multi-byte characters have the highest bit set to 1).

like image 197
fortran Avatar answered Sep 26 '22 12:09

fortran