Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Image scraping program in Python not functioning as intended

Tags:

python

image

My code only returns an empty string, and I have no idea why.

import urllib2

def getImage(url):
    page = urllib2.urlopen(url)
    page = page.read() #Gives HTML to parse

    start = page.find('<a img=')
    end = page.find('>', start)

    img = page[start:end]

return img

It would only return the first image it finds, so it's not a very good image scraper; that said, my primary goal right now is just to be able to find an image. I'm unable to.

like image 632
user1753520 Avatar asked Oct 07 '22 07:10

user1753520


2 Answers

Consider using BeautifulSoup to parse your HTML:

from BeautifulSoup import BeautifulSoup
import urllib
url  = 'http://www.google.com'
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
for img in soup.findAll('img'):
     print img['src']
like image 50
tehmisvh Avatar answered Oct 09 '22 17:10

tehmisvh


You should use a library for this and there are several out there, but to answer your question by changing the code you showed us...

Your problem is that you are trying to find images, but images don't use the <a ...> tag. They use the <img ...> tag. Here is an example:

<img src="smiley.gif" alt="Smiley face" height="42" width="42">

What you should do is change your start = page.find('<a img=') line to start = page.find('<img ') like so:

def getImage(url):
    page = urllib2.urlopen(url)
    page = page.read() #Gives HTML to parse

    start = page.find('<img ')
    end = page.find('>', start)

    img = page[start:end+1]
    return img
like image 37
bohney Avatar answered Oct 09 '22 15:10

bohney