BeautifulSoup: extract text from anchor tag

Question

I want to extract:

text from following src of the image tag and
text of the anchor tag which is inside the div class data

I successfully manage to extract the img src, but am having trouble extracting the text from the anchor tag.

<a class="title" href="http://www.amazon.com/Nikon-COOLPIX-Digital-Camera-NIKKOR/dp/B0073HSK0K/ref=sr_1_1?s=electronics&amp;ie=UTF8&amp;qid=1343628292&amp;sr=1-1&amp;keywords=digital+camera">Nikon COOLPIX L26 16.1 MP Digital Camera with 5x Zoom NIKKOR Glass Lens and 3-inch LCD (Red)</a>

Here is the link for the entire HTML page.

Here is my code:

for div in soup.findAll('div', attrs={'class':'image'}):     print "
"     for data in div.findNextSibling('div', attrs={'class':'data'}):         for a in data.findAll('a', attrs={'class':'title'}):             print a.text     for img in div.findAll('img'):         print img['src']

What I am trying to do is extract the image src (link) and the title inside the div class=data, so for example:

 <a class="title" href="http://www.amazon.com/Nikon-COOLPIX-Digital-Camera-NIKKOR/dp/B0073HSK0K/ref=sr_1_1?s=electronics&amp;ie=UTF8&amp;qid=1343628292&amp;sr=1-1&amp;keywords=digital+camera">Nikon COOLPIX L26 16.1 MP Digital Camera with 5x Zoom NIKKOR Glass Lens and 3-inch LCD (Red)</a>

should extract:

Nikon COOLPIX L26 16.1 MP Digital Camera with 5x Zoom NIKKOR Glass Lens and 3-inch LCD (Red)

daedalus · Accepted Answer

This will help:

from bs4 import BeautifulSoup  data = '''<div class="image">         <a href="http://www.example.com/eg1">Content1<img           src="http://image.example.com/img1.jpg" /></a>         </div>         <div class="image">         <a href="http://www.example.com/eg2">Content2<img           src="http://image.example.com/img2.jpg" /> </a>         </div>'''  soup = BeautifulSoup(data)  for div in soup.findAll('div', attrs={'class':'image'}):     print(div.find('a')['href'])     print(div.find('a').contents[0])     print(div.find('img')['src'])

If you are looking into Amazon products then you should be using the official API. There is at least one Python package that will ease your scraping issues and keep your activity within the terms of use.

Pontios · Answer

In my case, it worked like that:

from BeautifulSoup import BeautifulSoup as bs  url="http://blabla.com"  soup = bs(urllib.urlopen(url)) for link in soup.findAll('a'):         print link.string

Hope it helps!

BeautifulSoup: extract text from anchor tag

Tags:

python

html

beautifulsoup

tags

scraper

add-semi-colons

2 Answers

daedalus

Pontios

Recent Activity

Donate For Us

BeautifulSoup: extract text from anchor tag

Tags:

python

html

beautifulsoup

tags

scraper

add-semi-colons

2 Answers

daedalus

Pontios

Related questions

Recent Activity

Donate For Us