I have a string that looks like so: <pre class="prettyprint"><code>6Â 918Â 417Â 712 </code></pre> The clear cut way to trim this string (as I understand Python) is simply to say the string is in a variable called <code>s</code>, we get: <pre class="prettyprint"><code>s.replace('Â ', '') </code></pre> That should do the trick. But of course it complains that the non-ASCII character <code>'\xc2'</code> in file blabla.py is not encoded. I never quite could understand how to switch between different encodings. Here's the code, it really is just the same as above, but now it's in context. The file is saved as UTF-8 in notepad and has the following header: <pre class="prettyprint"><code>#!/usr/bin/python2.4 # -*- coding: utf-8 -*- </code></pre> The code: <pre class="prettyprint"><code>f = urllib.urlopen(url) soup = BeautifulSoup(f) s = soup.find('div', {'id':'main_count'}) #making a print 's' here goes well. it shows 6Â 918Â 417Â 712 s.replace('Â ','') save_main_count(s) </code></pre> It gets no further than <code>s.replace</code>...

Throw out all characters that can't be interpreted as ASCII: <pre class="prettyprint"><code>def remove_non_ascii(s): return "".join(c for c in s if ord(c)<128) </code></pre> Keep in mind that this is guaranteed to work with the UTF-8 encoding (because all bytes in multi-byte characters have the highest bit set to 1).

How to make the python interpreter correctly handle non-ASCII characters in string operations?

Tags:

python

unicode

I have a string that looks like so:

6Â 918Â 417Â 712

The clear cut way to trim this string (as I understand Python) is simply to say the string is in a variable called s, we get:

s.replace('Â ', '')

That should do the trick. But of course it complains that the non-ASCII character '\xc2' in file blabla.py is not encoded.

I never quite could understand how to switch between different encodings.

Here's the code, it really is just the same as above, but now it's in context. The file is saved as UTF-8 in notepad and has the following header:

#!/usr/bin/python2.4 # -*- coding: utf-8 -*-

The code:

f = urllib.urlopen(url)  soup = BeautifulSoup(f)  s = soup.find('div', {'id':'main_count'})  #making a print 's' here goes well. it shows 6Â 918Â 417Â 712  s.replace('Â ','')  save_main_count(s)

It gets no further than s.replace...

434

asked Aug 27 '09 15:08

adergaard

1 Answers

Throw out all characters that can't be interpreted as ASCII:

def remove_non_ascii(s):     return "".join(c for c in s if ord(c)<128)

Keep in mind that this is guaranteed to work with the UTF-8 encoding (because all bytes in multi-byte characters have the highest bit set to 1).

197

answered Sep 26 '22 12:09

fortran

Related questions
                            
                                How to install PIP on Python 3.6?
                            
                                ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 88 from C header, got 80 from PyObject
                            
                                How do I use vi keys in ipython under *nix?
                            
                                How can you print a variable name in python? [duplicate]
                            
                                in Ipython notebook / Jupyter, Pandas is not displaying the graph I try to plot
                            
                                Django Rest Framework -- no module named rest_framework
                            
                                How to change the Spyder editor background to dark?
                            
                                Python dictionary get multiple values
                            
                                Does Flask support regular expressions in its URL routing?
                            
                                Sort a list of lists with a custom compare function
                            
                                Interleave multiple lists of the same length in Python
                            
                                How to force the Y axis to only use integers in Matplotlib? [duplicate]
                            
                                How to git commit nothing without an error?
                            
                                How do I delete a column that contains only zeros in Pandas?
                            
                                Extracting text from a PDF file using PDFMiner in python?
                            
                                How can you determine a point is between two other points on a line segment?
                            
                                Finding all possible permutations of a given string in python
                            
                                Boto3 to download all files from a S3 Bucket
                            
                                How can I get the current language in Django?
                            
                                How can I make a scatter plot colored by density in matplotlib?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With