How to remove \xa0 from string in Python?

Tags:

I am currently using Beautiful Soup to parse an HTML file and calling get_text(), but it seems like I'm being left with a lot of \xa0 Unicode representing spaces. Is there an efficient way to remove all of them in Python 2.7, and change them into spaces? I guess the more generalized question would be, is there a way to remove Unicode formatting?

I tried using: line = line.replace(u'\xa0',' '), as suggested by another thread, but that changed the \xa0's to u's, so now I have "u"s everywhere instead. ):

EDIT: The problem seems to be resolved by str.replace(u'\xa0', ' ').encode('utf-8'), but just doing .encode('utf-8') without replace() seems to cause it to spit out even weirder characters, \xc2 for instance. Can anyone explain this?

332

asked Jun 12 '12 09:06

zhuyxn

1 Answers

\xa0 is actually non-breaking space in Latin1 (ISO 8859-1), also chr(160). You should replace it with a space.

string = string.replace(u'\xa0', u' ')

When .encode('utf-8'), it will encode the unicode to utf-8, that means every unicode could be represented by 1 to 4 bytes. For this case, \xa0 is represented by 2 bytes \xc2\xa0.

Read up on http://docs.python.org/howto/unicode.html.

Please note: this answer in from 2012, Python has moved on, you should be able to use unicodedata.normalize now

170

answered Oct 03 '22 11:10

samwize

Related questions
                            
                                Using pickle.dump - TypeError: must be str, not bytes
                            
                                Splitting on last delimiter in Python string?
                            
                                NumPy array initialization (fill with identical values)
                            
                                Python Image Library fails with message "decoder JPEG not available" - PIL
                            
                                Dropping infinite values from dataframes in pandas?
                            
                                How do I get the object if it exists, or None if it does not exist in Django?
                            
                                ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
                            
                                In Python, how do you convert seconds since epoch to a `datetime` object?
                            
                                Getting a hidden password input
                            
                                How do I plot in real-time in a while loop using matplotlib?
                            
                                Getting the name of a variable as a string
                            
                                Count unique values per groups with Pandas [duplicate]
                            
                                Compare object instances for equality by their attributes
                            
                                Element-wise addition of 2 lists?
                            
                                How do I convert a numpy array to (and display) an image?
                            
                                Getting rid of \n when using .readlines() [duplicate]
                            
                                Select DataFrame rows between two dates
                            
                                Split a string by spaces -- preserving quoted substrings -- in Python
                            
                                Check if two unordered lists are equal [duplicate]
                            
                                How to sum all the values in a dictionary?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to remove \xa0 from string in Python?

Tags:

python

unicode

beautifulsoup

utf-8

python-2.7

zhuyxn

People also ask

1 Answers

samwize

Recent Activity

Donate For Us