Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to extract a list of items using scrapely?

I'm using scrapely to extract data from some HTML, but I'm having difficulties extracting a list of items.

The scrapely github project describes only a simple example:

from scrapely import Scraper
s = Scraper()

s.train(url, data)
s.scrape(another_url)

This is nice if, for example, you are trying to extract data as described:

Usage (API)

Scrapely has a powerful API, including a template format that can be edited externally, that you can use to build very capable scrapers.

What follows that section is a quick example of the simplest possible usage, that you can run in a Python shell.

However, I'm not sure how to extract data if you found something like

Ingredientes

- 50 gr de hojas de albahaca
- 4 cucharadas (60 ml) de piñones
- 2 - 4 dientes de ajo
- 120 ml (1/2 vaso) de aceite de oliva virgen extra
- 115 gr de queso parmesano recién rallado
- 25 gr de queso pecorino recién rallado ( o queso de leche de oveja curado)

I know I can't extract this by using xpath or css selector, but I'm more interested in using parsers that can extract data for similar pages.

like image 836
rkmax Avatar asked Jan 06 '23 23:01

rkmax


1 Answers

Scrapely can be trained to extract a list of items. The trick is to pass the first and last items of the list to be extracted as a Python list when training. Here an example inspired by the question: (Training: 10-item ingredient list from url1, test: 7-item list from url2.)

from scrapely import Scraper

s = Scraper()

url1 = 'http://www.sabormediterraneo.com/recetas/postres/leche_frita.htm'
data = {'ingreds': ['medio litro de leche',   # first and last items
  u'canela y az\xfacar para espolvorear']}
s.train(url1, data)

url2 = 'http://www.sabormediterraneo.com/recetas/cordero_horno.htm'
print s.scrape(url2)

Here the output:

[{u'ingreds': [
  u' 2 piernas o dos paletillas de cordero lechal o recental ',
  u'3 dientes de ajo',
  u'una copita de vino tinto / o / blanco',
  u'una copita de agua',
  u'media copita de aceite de oliva',
  u'or\xe9gano, perejil',
  u'sal, pimienta negra y aceite de oliva']}]

Training on the question's ingredient list (http://www.sabormediterraneo.com/cocina/salsas6.htm) did not generalize directly to the "recetas" pages. One solution would be to train several scrapers and then check which one works on a given page. (Training one scraper on several pages did not give a general solution in a quick test of mine.)

like image 127
Ulrich Stern Avatar answered Jan 14 '23 19:01

Ulrich Stern