Here is my code:
html = '''<img onload='javascript:if(this.width>950) this.width=950'
src="http://ww4.sinaimg.cn/mw600/c3107d40jw1e3rt4509j.jpg">'''
soup = BeautifulSoup(html)
imgs = soup.findAll('img')
print imgs[0].attrs
It will print [(u'onload', u'javascript:if(this.width>950) this.width=950')]
So where is the src
attribute?
If I replace html by something like html = '''<img src="/image/fluffybunny.jpg" title="Harvey the bunny" alt="a cute little fluffy bunny" />'''
I get correct results as [(u'src', u'/image/fluffybunny.jpg'), (u'title', u'Harvey the bunny'), (u'alt', u'a cute little fluffy bunny')]
I am quite new to HTML and beautifulsoup. Am I missing some knowledge? Thanks for any ideas.
I tested this with both versions three and four of BeautifulSoup, and noticed that bs4
(version 4) seems to fix up your HTML better than version 3.
With BeautifulSoup 3:
>>> html = """<img onload='javascript:if(this.width>950) this.width=950' src="http://ww4.sinaimg.cn/mw600/c3107d40jw1e3rt4509j.jpg">"""
>>> soup = BeautifulSoup(html) # Version 3 of BeautifulSoup
>>> print soup
<img onload="javascript:if(this.width>950) this.width=950" />950) this.width=950' src="http://ww4.sinaimg.cn/mw600/c3107d40jw1e3rt4509j.jpg">
Notice how >
is now >
and some bits are out of place.
Also, when you call BeautifulSoup(), it kind of splits it up. If you were to print soup.img, you would get:
<img onload="javascript:if(this.width>950) this.width=950" />
And so you would miss details.
With bs4
(BeautifulSoup 4, the current version):
>>> html = '''<img onload='javascript:if(this.width>950) this.width=950' src="http://ww4.sinaimg.cn/mw600/c3107d40jw1e3rt4509j.jpg">'''
>>> soup = BeautifulSoup(html)
>>> print soup
<html><body><img onload="javascript:if(this.width>950) this.width=950" src="http://ww4.sinaimg.cn/mw600/c3107d40jw1e3rt4509j.jpg"/></body></html>
Now with .attrs
: In BeautifulSoup 3, it returns a list of tuples, as is what you have discovered. In BeautifulSoup 4, it returns a dictionary:
>>> print soup.findAll('img')[0].attrs # Version 3
[(u'onload', u'javascript:if(this.width>950) this.width=950')]
>>> print soup.findAll('img')[0].attrs # Version 4
{'onload': 'javascript:if(this.width>950) this.width=950', 'src': 'http://ww4.sinaimg.cn/mw600/c3107d40jw1e3rt4509j.jpg'}
So what to do? Get BeautifulSoup 4. It will parse the HTML much better.
By the way, if all you want is just the src
, calling .attrs
is not needed:
>>> print soup.findAll('img')[0].get('src')
http://ww4.sinaimg.cn/mw600/c3107d40jw1e3rt4509j.jpg
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With