So I parsed a html page with .findAll
(BeautifulSoup) to variable named result
.
If I type result
in Python shell then press Enter, I see normal text as expected, but as I wanted to postprocess this result as string object, I noticed that str(result)
returns garbage, like this sample:
\xd1\x87\xd0\xb8\xd0\xbb\xd0\xbd\xd0\xb8\xd1\x86\xd0\xb0</a><br />\n<hr />\n</div>
Html page source is utf-8
encoded
How can I handle this?
Code is basically this, in case it matters:
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(urllib.open(url).read())
result = soup.findAll(something)
Python is 2.7
Python 2.6.7 BeautifulSoup.version 3.2.0
This worked for me:
unicode.join(u'\n',map(unicode,result))
I'm pretty sure a result
is a BeautifulSoup.ResultSet
object, which seems to be an extension of the standard python list
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(urllib.open(url).read())
#findAll should get multiple parsed result
result = soup.findAll(something)
#then iterate result
for line in result:
#get str value from each line,replace charset with utf-8 or other charset you need
print line.__str__('charset')
BTW:BeautifulSoup's version is beautifulsoup-3.2.1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With