I need to get all the text from a page using BeautifulSoup. At BeautifulSoup's documentation, it showed that you could do soup.get_text()
to do this. When I tried doing this on reddit.com, I got this error:
UnicodeEncodeError in soup.py:16
'cp932' codec can't encode character u'\xa0' in position 2262: illegal multibyte sequence
I get errors like that on most of the sites I checked.
I got similar errors when I did soup.prettify()
too, but I fixed it by changing it to soup.prettify('UTF-8')
. Is there any way to fix this? Thanks in advance!
Update June 24
I've found a bit of code that seems to work for other people, but I still need to use UTF-8 instead of the default. Code:
texts = soup.findAll(text=True)
def visible(element):
if element.parent.name in ['style', 'script', '[document]', 'head', 'title']:
return False
elif re.match('', str(element)): return False
elif re.match('\n', str(element)): return False
return True
visible_texts = filter(visible, texts)
print visible_texts
Error is different, though. Progress?
UnicodeEncodeError in soup.py:29
'ascii' codec can't encode character u'\xbb' in position 1: ordinal not in range
(128)
soup.get_text() returns a Unicode string that's why you're getting the error.
You can solve this in a number of ways including setting the encoding at the shell level.
export PYTHONIOENCODING=UTF-8
You can reload sys and set the encoding by including this in your script.
if __name__ == "__main__":
reload(sys)
sys.setdefaultencoding("utf-8")
Or you can encode the string as utf-8 in code. For your reddit problem something like the following would work:
import urllib
from bs4 import BeautifulSoup
url = "https://www.reddit.com/r/python"
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
# get text
text = soup.get_text()
print(text.encode('utf-8'))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With