I am trying to grab some text from html documents with BeautifulSoup. In a very relavant case for me, it originates a strange and interesting result: after a certain point, the soup is full of extra spaces within the text (a space separates every letter from the following one). I tried to search the web in order to find a reason for that, but I met only some news about the opposite bug (no spaces at all).
Do you have some suggestion or hint on why it happens, and how to solve this problem?.
This is the very basic code that i created:
from bs4 import BeautifulSoup
import urllib2
html = urllib2.urlopen("http://www.beppegrillo.it")
prova = html.read()
soup = BeautifulSoup(prova)
print soup
And this is a line taken from the results, the line where this problem start to appear:
value=\"Giuseppe labbate ogm? non vorremmo nuovi uccelli chiamati lontre\"><input onmouseover=\"Tip('<cen t e r c l a s s = \ \ ' t i t l e _ v i d e o \ \ ' > < b > G i u s e p p e l a b b a t e o g m ? n o n v o r r e m m o n u o v i u c c e l l i c h i a m a t i l o n t r e <
I believe this is a bug with Lxml's HTML parser. Try:
from bs4 import BeautifulSoup
import urllib2
html = urllib2.urlopen ("http://www.beppegrillo.it")
prova = html.read()
soup = BeautifulSoup(prova.replace('ISO-8859-1', 'utf-8'))
print soup
Which is a workaround for the problem. I believe the issue was fixed in lxml 3.0 alpha 2 and lxml 2.3.6, so it could be worth checking whether you need to upgrade to a newer version.
If you want more info on the bug it was initially filed here:
https://bugs.launchpad.net/beautifulsoup/+bug/972466
Hope this helps,
Hayden
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With