Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why I'm getting "UnicodeEncodeError: 'charmap' codec can't encode character '\u25b2' in position 84811: character maps to <undefined>" error?

I'm getting UnicodeEncodeError: 'charmap' codec can't encode character '\u200b' in position 756: character maps to error while running this code::

from bs4 import BeautifulSoup
import requests
r = requests.get('https://stackoverflow.com').text
soup = BeautifulSoup(r, 'lxml')
print(soup.prettify())

and the output is:

Traceback (most recent call last):
  File "c:\Users\Asus\Documents\Hello World\Web Scraping\st.py", line 5, in <module>
    print(soup.prettify())
  File "C:\Users\Asus\AppData\Local\Programs\Python\Python38\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u200b' in position 756: character maps to <undefined>

I'm using python 3.8.1 and UTF-8 in vs code. How to solve this?

like image 963
Asir Shahriar Roudra Avatar asked Jun 30 '20 11:06

Asir Shahriar Roudra


2 Answers

There are hints in the full error message... I will keep here what seems most important:

Traceback ...
  File "...\cp1252.py", ...
UnicodeEncodeError: 'charmap' codec can't encode character '\u200b' ...

The error is caused by the print call. Somewhere in you text, you have a ZERO WIDTH SPACE character (Unicode U+200B), and if you print to a Windows console, the string is internally encoded into the Windows console code page (cp1252 here). And the ZERO WIDTH SPACE is not represented in that code page. BTW the default console is not really unicode friendly in Windows.

There is little to do in a Windows console. I would advise you to try one of these workarounds:

  • do not print to the console but write to a (utf8) file. You will then be able to read it with a utf8 enabled text editor like notepad++

  • manually encode anything before printing it, with errors='ignore' or errors='replace'. That way, possibly offending characters will be ignored and no error will arise

      print(soup.prettify().encode('cp1252', errors='ignore'))
    
like image 161
Serge Ballesta Avatar answered Oct 18 '22 05:10

Serge Ballesta


You can explore little bit on your own... but for python 2.7 what i usually do is use this to clean my text:

text = text.encode('utf-8').decode('ascii', 'ignore')

python 3 equivalent for this is simply:

text = str(text)

For your case, try this:

r = requests.get('https://stackoverflow.com').text.encode('utf8').decode('ascii', 'ignore')

otherwise normally:

r = requests.get('https://stackoverflow.com')
soup = BeautifulSoup(r.content, 'lxml')
print soup

(I don't think this should give any error.)

like image 13
vaibhav s Avatar answered Oct 18 '22 05:10

vaibhav s