BeautifulSoup return unexpected extra spaces

Question

I am trying to grab some text from html documents with BeautifulSoup. In a very relavant case for me, it originates a strange and interesting result: after a certain point, the soup is full of extra spaces within the text (a space separates every letter from the following one). I tried to search the web in order to find a reason for that, but I met only some news about the opposite bug (no spaces at all).

Do you have some suggestion or hint on why it happens, and how to solve this problem?.

This is the very basic code that i created:

from bs4 import BeautifulSoup

import urllib2
html = urllib2.urlopen("http://www.beppegrillo.it")
prova = html.read()
soup = BeautifulSoup(prova)
print soup

And this is a line taken from the results, the line where this problem start to appear:

value=\"Giuseppe labbate ogm? non vorremmo nuovi uccelli chiamati lontre\"><input onmouseover=\"Tip('<cen t e r c l a s s = \ \ ' t i t l e _ v i d e o \ \ ' > < b > G i u s e p p e l a b b a t e o g m ? n o n v o r r e m m o n u o v i u c c e l l i c h i a m a t i l o n t r e <

Hayden · Accepted Answer

I believe this is a bug with Lxml's HTML parser. Try:

from bs4 import BeautifulSoup

import urllib2
html = urllib2.urlopen ("http://www.beppegrillo.it")
prova = html.read()
soup = BeautifulSoup(prova.replace('ISO-8859-1', 'utf-8'))
print soup

Which is a workaround for the problem. I believe the issue was fixed in lxml 3.0 alpha 2 and lxml 2.3.6, so it could be worth checking whether you need to upgrade to a newer version.

If you want more info on the bug it was initially filed here:

https://bugs.launchpad.net/beautifulsoup/+bug/972466

Hope this helps,

Hayden

BeautifulSoup return unexpected extra spaces

Tags:

python

html

text

beautifulsoup

hugi coapete

1 Answers

Hayden

Recent Activity

Donate For Us

BeautifulSoup return unexpected extra spaces

Tags:

python

html

text

beautifulsoup

hugi coapete

1 Answers

Hayden

Related questions

Recent Activity

Donate For Us