I'm trying to parse a web site and I'm going to use it later in my Django project. To do that, I'm using urllib2 and BeautifulSoup4. However, I couldn't get what I want. The output of BeautifulSoup object is weird. I tried different pages, it worked (output is normal). I thought it is because of the page. Then, when my friend tried to do the same thing, he got normal output. I couldn't manage to figure out problem.
This is the website I'm going to parse.
This is an example of the weird output after the command "soup.prettify()":
t d B G C O L O R = " # 9 9 0 4 0 4 " w i d t h = " 3 " > i m g S R C = " 1 p . g i f " A L T B O R D E R = " 0 " h e i g h t = " 1 " w i d t h = " 3 " > / t d > \n / t r > \n t r > \n t d c o l s p a n = " 3 " B G C O L O R = " # 9 9 0 4 0 4 " w i d t h = " 6 0 0 " h e i g h t = " 3 " > i m g s r c = " 1 p . g i f " w i d t h = " 6 0 0 " \n h e i g h t = " 1 " > / t d > \n / t r > \n / t a b l e > \n / c e n t e r > / d i v > \n \n p > &n b s p ; &n b s p ; &n b s p ; &n b s p ; / p > \n / b o d y > \n / h t m l >\n </p>\n </body>\n</html>'
Here is a minimal example that does work for me, including the snippet of html that you have a problem with. It's hard to tell without your code, but my guess is you did something like ' '.join(A.split())
somewhere.
import urllib2, bs4
url = "http://kafemud.bilkent.edu.tr/monu_tr.html"
req = urllib2.urlopen(url)
raw = req.read()
soup = bs4.BeautifulSoup(raw)
print soup.prettify().encode('utf-8')
Giving:
....
<td bgcolor="#990404" width="3">
<img alt="" border="0" src="1p.gif" width="3"/>
</td>
<td bgcolor="#FFFFFF" valign="TOP">
<div align="left">
<table align="left" border="0" cellpadding="10" cellspacing="0" valign="TOP" width="594">
<tr>
<td align="left" valign="top">
<table align="left" border="0" cellpadding="0" cellspacing="0" class="icerik" width="574">
....
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With