I want to extract:
image
tag anddiv
class dataI successfully manage to extract the img src, but am having trouble extracting the text from the anchor tag.
<a class="title" href="http://www.amazon.com/Nikon-COOLPIX-Digital-Camera-NIKKOR/dp/B0073HSK0K/ref=sr_1_1?s=electronics&ie=UTF8&qid=1343628292&sr=1-1&keywords=digital+camera">Nikon COOLPIX L26 16.1 MP Digital Camera with 5x Zoom NIKKOR Glass Lens and 3-inch LCD (Red)</a>
Here is the link for the entire HTML page.
Here is my code:
for div in soup.findAll('div', attrs={'class':'image'}): print "\n" for data in div.findNextSibling('div', attrs={'class':'data'}): for a in data.findAll('a', attrs={'class':'title'}): print a.text for img in div.findAll('img'): print img['src']
What I am trying to do is extract the image src (link) and the title inside the div class=data
, so for example:
<a class="title" href="http://www.amazon.com/Nikon-COOLPIX-Digital-Camera-NIKKOR/dp/B0073HSK0K/ref=sr_1_1?s=electronics&ie=UTF8&qid=1343628292&sr=1-1&keywords=digital+camera">Nikon COOLPIX L26 16.1 MP Digital Camera with 5x Zoom NIKKOR Glass Lens and 3-inch LCD (Red)</a>
should extract:
Nikon COOLPIX L26 16.1 MP Digital Camera with 5x Zoom NIKKOR Glass Lens and 3-inch LCD (Red)
This will help:
from bs4 import BeautifulSoup data = '''<div class="image"> <a href="http://www.example.com/eg1">Content1<img src="http://image.example.com/img1.jpg" /></a> </div> <div class="image"> <a href="http://www.example.com/eg2">Content2<img src="http://image.example.com/img2.jpg" /> </a> </div>''' soup = BeautifulSoup(data) for div in soup.findAll('div', attrs={'class':'image'}): print(div.find('a')['href']) print(div.find('a').contents[0]) print(div.find('img')['src'])
If you are looking into Amazon products then you should be using the official API. There is at least one Python package that will ease your scraping issues and keep your activity within the terms of use.
In my case, it worked like that:
from BeautifulSoup import BeautifulSoup as bs url="http://blabla.com" soup = bs(urllib.urlopen(url)) for link in soup.findAll('a'): print link.string
Hope it helps!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With