Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

BeautifulSoup: extract text from anchor tag

I want to extract:

  • text from following src of the image tag and
  • text of the anchor tag which is inside the div class data

I successfully manage to extract the img src, but am having trouble extracting the text from the anchor tag.

<a class="title" href="http://www.amazon.com/Nikon-COOLPIX-Digital-Camera-NIKKOR/dp/B0073HSK0K/ref=sr_1_1?s=electronics&amp;ie=UTF8&amp;qid=1343628292&amp;sr=1-1&amp;keywords=digital+camera">Nikon COOLPIX L26 16.1 MP Digital Camera with 5x Zoom NIKKOR Glass Lens and 3-inch LCD (Red)</a>  

Here is the link for the entire HTML page.

Here is my code:

for div in soup.findAll('div', attrs={'class':'image'}):     print "\n"     for data in div.findNextSibling('div', attrs={'class':'data'}):         for a in data.findAll('a', attrs={'class':'title'}):             print a.text     for img in div.findAll('img'):         print img['src'] 

What I am trying to do is extract the image src (link) and the title inside the div class=data, so for example:

 <a class="title" href="http://www.amazon.com/Nikon-COOLPIX-Digital-Camera-NIKKOR/dp/B0073HSK0K/ref=sr_1_1?s=electronics&amp;ie=UTF8&amp;qid=1343628292&amp;sr=1-1&amp;keywords=digital+camera">Nikon COOLPIX L26 16.1 MP Digital Camera with 5x Zoom NIKKOR Glass Lens and 3-inch LCD (Red)</a>  

should extract:

Nikon COOLPIX L26 16.1 MP Digital Camera with 5x Zoom NIKKOR Glass Lens and 3-inch LCD (Red)

like image 765
add-semi-colons Avatar asked Jul 30 '12 06:07

add-semi-colons


2 Answers

This will help:

from bs4 import BeautifulSoup  data = '''<div class="image">         <a href="http://www.example.com/eg1">Content1<img           src="http://image.example.com/img1.jpg" /></a>         </div>         <div class="image">         <a href="http://www.example.com/eg2">Content2<img           src="http://image.example.com/img2.jpg" /> </a>         </div>'''  soup = BeautifulSoup(data)  for div in soup.findAll('div', attrs={'class':'image'}):     print(div.find('a')['href'])     print(div.find('a').contents[0])     print(div.find('img')['src']) 

If you are looking into Amazon products then you should be using the official API. There is at least one Python package that will ease your scraping issues and keep your activity within the terms of use.

like image 164
daedalus Avatar answered Oct 09 '22 11:10

daedalus


In my case, it worked like that:

from BeautifulSoup import BeautifulSoup as bs  url="http://blabla.com"  soup = bs(urllib.urlopen(url)) for link in soup.findAll('a'):         print link.string 

Hope it helps!

like image 38
Pontios Avatar answered Oct 09 '22 10:10

Pontios