Python using Beautiful Soup for HTML processing on specific content

Question

So when I decided to parse content from a website. For example, http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx

I want to parse the ingredients into a text file. The ingredients are located in:

< div class="ingredients" style="margin-top: 10px;">

and within this, each ingredient is stored between

< li class="plaincharacterwrap">

Someone was nice enough to provide code using regex, but it gets confusing when you are modyfying from site to site. So I wanted to use Beautiful Soup since it has a lot of built in features. Except I can confused on how to actually do it.

Code:

import re
import urllib2,sys
from BeautifulSoup import BeautifulSoup, NavigableString
html = urllib2.urlopen("http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx")
soup = BeautifulSoup(html)

try:

        ingrdiv = soup.find('div', attrs={'class': 'ingredients'})

except IOError: 
        print 'IO error'

Is this kind of how you get started? I want to find the actual div class and then parse out all those ingredients located within the li class.

Any help would be appreciated! Thanks!

Hugh Bothwell · Accepted Answer

import urllib2
import BeautifulSoup

def main():
    url = "http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx"
    data = urllib2.urlopen(url).read()
    bs = BeautifulSoup.BeautifulSoup(data)

    ingreds = bs.find('div', {'class': 'ingredients'})
    ingreds = [s.getText().strip() for s in ingreds.findAll('li')]

    fname = 'PorkChopsRecipe.txt'
    with open(fname, 'w') as outf:
        outf.write('
'.join(ingreds))

if __name__=="__main__":
    main()

results in

1/4 cup olive oil
1 cup chicken broth
2 cloves garlic, minced
1 tablespoon paprika
1 tablespoon garlic powder
1 tablespoon poultry seasoning
1 teaspoon dried oregano
1 teaspoon dried basil
4 thick cut boneless pork chops
salt and pepper to taste

.

Follow-up response to @eyquem:

from time import clock
import urllib
import re
import BeautifulSoup
import lxml.html

start = clock()
url = 'http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx'
data = urllib.urlopen(url).read()
print "Loading took", (clock()-start), "s"

# by regex
start = clock()
x = data.find('Ingredients</h3>')
patingr = re.compile('<li class="plaincharacterwrap">
 +(.+?)</li>
')
res1 = '
'.join(patingr.findall(data,x))
print "Regex parse took", (clock()-start), "s"

# by BeautifulSoup
start = clock()
bs = BeautifulSoup.BeautifulSoup(data)
ingreds = bs.find('div', {'class': 'ingredients'})
res2 = '
'.join(s.getText().strip() for s in ingreds.findAll('li'))
print "BeautifulSoup parse took", (clock()-start), "s  - same =", (res2==res1)

# by lxml
start = clock()
lx = lxml.html.fromstring(data)
ingreds = lx.xpath('//div[@class="ingredients"]//li/text()')
res3 = '
'.join(s.strip() for s in ingreds)
print "lxml parse took", (clock()-start), "s  - same =", (res3==res1)

gives

Loading took 1.09091222621 s
Regex parse took 0.000432703726233 s
BeautifulSoup parse took 0.28126133314 s  - same = True
lxml parse took 0.0100940499505 s  - same = True

Regex is much faster (except when it's wrong); but if you consider loading the page and parsing it together, BeautifulSoup is still only 20% of the runtime. If you are terribly concerned about speed, I recommend lxml instead.

Python using Beautiful Soup for HTML processing on specific content

Tags:

python

html

parsing

beautifulsoup

Eric

1 Answers

Hugh Bothwell

Recent Activity

Donate For Us

Python using Beautiful Soup for HTML processing on specific content

Tags:

python

html

parsing

beautifulsoup

Eric

1 Answers

Hugh Bothwell

Related questions

Recent Activity

Donate For Us