Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Beautifulsoup can't extract src attribute from img tag

Here is my code:

html = '''<img onload='javascript:if(this.width>950) this.width=950'
src="http://ww4.sinaimg.cn/mw600/c3107d40jw1e3rt4509j.jpg">'''
soup = BeautifulSoup(html)
imgs = soup.findAll('img')

print imgs[0].attrs

It will print [(u'onload', u'javascript:if(this.width>950) this.width=950')]

So where is the src attribute?

If I replace html by something like html = '''<img src="/image/fluffybunny.jpg" title="Harvey the bunny" alt="a cute little fluffy bunny" />'''

I get correct results as [(u'src', u'/image/fluffybunny.jpg'), (u'title', u'Harvey the bunny'), (u'alt', u'a cute little fluffy bunny')]

I am quite new to HTML and beautifulsoup. Am I missing some knowledge? Thanks for any ideas.

like image 704
foresightyj Avatar asked Feb 17 '23 19:02

foresightyj


1 Answers

I tested this with both versions three and four of BeautifulSoup, and noticed that bs4 (version 4) seems to fix up your HTML better than version 3.

With BeautifulSoup 3:

>>> html = """<img onload='javascript:if(this.width>950) this.width=950' src="http://ww4.sinaimg.cn/mw600/c3107d40jw1e3rt4509j.jpg">"""
>>> soup = BeautifulSoup(html) # Version 3 of BeautifulSoup
>>> print soup
<img onload="javascript:if(this.width&gt;950) this.width=950" />950) this.width=950' src="http://ww4.sinaimg.cn/mw600/c3107d40jw1e3rt4509j.jpg"&gt;

Notice how > is now &gt; and some bits are out of place.

Also, when you call BeautifulSoup(), it kind of splits it up. If you were to print soup.img, you would get:

<img onload="javascript:if(this.width&gt;950) this.width=950" />

And so you would miss details.

With bs4 (BeautifulSoup 4, the current version):

>>> html = '''<img onload='javascript:if(this.width>950) this.width=950' src="http://ww4.sinaimg.cn/mw600/c3107d40jw1e3rt4509j.jpg">'''
>>> soup = BeautifulSoup(html) 
>>> print soup
<html><body><img onload="javascript:if(this.width&gt;950) this.width=950" src="http://ww4.sinaimg.cn/mw600/c3107d40jw1e3rt4509j.jpg"/></body></html>

Now with .attrs: In BeautifulSoup 3, it returns a list of tuples, as is what you have discovered. In BeautifulSoup 4, it returns a dictionary:

>>> print soup.findAll('img')[0].attrs # Version 3
[(u'onload', u'javascript:if(this.width>950) this.width=950')]

>>> print soup.findAll('img')[0].attrs # Version 4
{'onload': 'javascript:if(this.width>950) this.width=950', 'src': 'http://ww4.sinaimg.cn/mw600/c3107d40jw1e3rt4509j.jpg'}

So what to do? Get BeautifulSoup 4. It will parse the HTML much better.

By the way, if all you want is just the src, calling .attrs is not needed:

>>> print soup.findAll('img')[0].get('src')
http://ww4.sinaimg.cn/mw600/c3107d40jw1e3rt4509j.jpg
like image 101
TerryA Avatar answered Mar 03 '23 18:03

TerryA