Getting BeautifulSoup to find a specific <p>

Tags:

I'm trying to put together a basic HTML scraper for a variety of scientific journal websites, specifically trying to get the abstract or introductory paragraph.

The current journal I'm working on is Nature, and the article I've been using as my sample can be seen at http://www.nature.com/nature/journal/v463/n7284/abs/nature08715.html.

I can't get the abstract out of that page, however. I'm searching for everything between the <p class="lead">...</p> tags, but I can't seem to figure out how to isolate them. I thought it would be something simple like

from BeautifulSoup import BeautifulSoup
import re
import urllib2

address="http://www.nature.com/nature/journal/v463/n7284/full/nature08715.html"
html = urllib2.urlopen(address).read()
soup = BeautifulSoup(html)

abstract = soup.find('p', attrs={'class' : 'lead'})
print abstract

Using Python 2.5, BeautifulSoup 3.0.8, running this returns 'None'. I have no option of using anything else that needs to be compiled/installed (like lxml). Is BeautifulSoup confused, or am I?

822

asked Mar 26 '10 06:03

Ryan

2 Answers

That html is pretty much malformed, and xml.dom.minidom cannot parse, and BeautiFulSoup parsing not working well.

I removed some  parts and parse again with BeautiFulSoup, then its seems better and able to run soup.find('p', attrs={'class' : 'lead'})

Here's the code I tried

>>> html =re.sub(re.compile("<!--.*?-->",re.DOTALL),"",html)
>>>
>>> soup=BeautifulSoup(html)
>>>
>>> soup.find('p', attrs={'class' : 'lead'})
<p class="lead">The class of exotic Jupiter-mass planets that orb  .....

answered Sep 28 '22 15:09

YOU

here's a non BS way to get the abstract.

address="http://www.nature.com/nature/journal/v463/n7284/full/nature08715.html"
html = urllib2.urlopen(address).read()
for para in html.split("</p>"):
    if '<p class="lead">' in para:
        abstract=para.split('<p class="lead">')[1:][0]
        print ' '.join(abstract.split("\n"))

answered Sep 28 '22 15:09

ghostdog74

Related questions
                            
                                What is the Yahoo openid discovery endpoint
                            
                                force unpacking of certain egg directories
                            
                                Python zlib not decodable when returned by an http response
                            
                                Running Django Tests with a Precommit Hook
                            
                                Google Alerts web API for Python?
                            
                                Using Python simplejson to return pregenerated json
                            
                                Python ctypes addressof CFuncType
                            
                                Cancel query execution in pyscopg2
                            
                                Python-style classmethod for C#?
                            
                                New to OpenGL and deprecation [closed]
                            
                                Matplotlib in Python - Drawing shapes and animating them
                            
                                Basics of string based protocol security
                            
                                How to create folders using file names and then move files into folders?
                            
                                Saving an image of what a device context drew, wxPython
                            
                                Python PIL: color index to RGB
                            
                                Sed script to edit csv file Or Python
                            
                                Python: How do I create sequential file names?
                            
                                Circular import issues with Django apps that have dependencies on each other
                            
                                Open Windows shared folder through linux machine
                            
                                concatenate multi values in one record without duplication

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Getting BeautifulSoup to find a specific <p>

Tags:

python

beautifulsoup

html-content-extraction

Ryan

People also ask

2 Answers

YOU

ghostdog74

Recent Activity

Donate For Us