Running the crawler doesn't get the same data it does when training

Question

When training my crawler to scrape a Yelp page, it gets all the information without me doing anything, but when I run the crawler, the address is not recognized, and doesn't record.

Nick Scott · Accepted Answer

Getting company data from Yelp

In this case we want to get addresses for companies in San Francisco from the website www.yelp.com.

Analysis of the site

We can get a list of companies beginning with the letter “A” from this page:

http://www.yelp.com/sm/san-francisco-ca-us/a/1

This directory page tells us there are 42 pages of results for “A” with up to 80 results per page.

This is good news.

Create an API

I am now going to create an API to get the data from the first page and then use Bulk Extract to pass through a list of URLs to all 42 pages.

Using Magic, I can generate an API in just a few clicks:

Go to Magic.import.io
Paste in the URL to the Yelp page (link above)
Click “Extract Data”
Click “Get API”
Click “Copy this to My Data”

Now we have an API!

(Note that if you need more control over what to include or exclude from the API you can use an Extractor)

Generate a list of URLs

To generate a list of URLs that will allow us to get data from pages 1 through 42 I am going to use an external service hosted at:

http://texttool.blogspot.co.uk/

Locate the "generate list of numbers" tool and generate a list of URLs:

http://www.yelp.com/sm/san-francisco-ca-us/a/1
http://www.yelp.com/sm/san-francisco-ca-us/a/2
http://www.yelp.com/sm/san-francisco-ca-us/a/3
http://www.yelp.com/sm/san-francisco-ca-us/a/4
http://www.yelp.com/sm/san-francisco-ca-us/a/5
http://www.yelp.com/sm/san-francisco-ca-us/a/6
http://www.yelp.com/sm/san-francisco-ca-us/a/7
http://www.yelp.com/sm/san-francisco-ca-us/a/8
http://www.yelp.com/sm/san-francisco-ca-us/a/9
http://www.yelp.com/sm/san-francisco-ca-us/a/10
http://www.yelp.com/sm/san-francisco-ca-us/a/11
http://www.yelp.com/sm/san-francisco-ca-us/a/12
http://www.yelp.com/sm/san-francisco-ca-us/a/13
http://www.yelp.com/sm/san-francisco-ca-us/a/14
http://www.yelp.com/sm/san-francisco-ca-us/a/15
http://www.yelp.com/sm/san-francisco-ca-us/a/16
http://www.yelp.com/sm/san-francisco-ca-us/a/17
http://www.yelp.com/sm/san-francisco-ca-us/a/18
http://www.yelp.com/sm/san-francisco-ca-us/a/19
http://www.yelp.com/sm/san-francisco-ca-us/a/20
http://www.yelp.com/sm/san-francisco-ca-us/a/21
http://www.yelp.com/sm/san-francisco-ca-us/a/22
http://www.yelp.com/sm/san-francisco-ca-us/a/23
http://www.yelp.com/sm/san-francisco-ca-us/a/24
http://www.yelp.com/sm/san-francisco-ca-us/a/25
http://www.yelp.com/sm/san-francisco-ca-us/a/26
http://www.yelp.com/sm/san-francisco-ca-us/a/27
http://www.yelp.com/sm/san-francisco-ca-us/a/28
http://www.yelp.com/sm/san-francisco-ca-us/a/29
http://www.yelp.com/sm/san-francisco-ca-us/a/30
http://www.yelp.com/sm/san-francisco-ca-us/a/31
http://www.yelp.com/sm/san-francisco-ca-us/a/32
http://www.yelp.com/sm/san-francisco-ca-us/a/33
http://www.yelp.com/sm/san-francisco-ca-us/a/34
http://www.yelp.com/sm/san-francisco-ca-us/a/35
http://www.yelp.com/sm/san-francisco-ca-us/a/36
http://www.yelp.com/sm/san-francisco-ca-us/a/37
http://www.yelp.com/sm/san-francisco-ca-us/a/38
http://www.yelp.com/sm/san-francisco-ca-us/a/39
http://www.yelp.com/sm/san-francisco-ca-us/a/40
http://www.yelp.com/sm/san-francisco-ca-us/a/41
http://www.yelp.com/sm/san-francisco-ca-us/a/42

Bulk Extract

Now you can use Bulk Extract to get data from each of those URLs in one go.

To do this:

Go to the Configure tab on your Yelp API
Select Bulk Extract from the drop down
Paste in the list of 42 URLs
Click “Run Queries”

Note: you may get some failed queries. By clicking the “X URLs failed” text you can retry the failed queries.

Export

You can now export this data to a spreadsheet, as HTML or JSON.

Further reading

http://support.import.io/knowledgebase/articles/669784-getting-company-data-from-yelp

Amr Ali · Answer

you should use xpath to select what ever you want on yelp , i have done it before for yelp and xpath is more accurate than manual training .

Running the crawler doesn't get the same data it does when training

Tags:

import.io

Ian Woodfill

2 Answers

Nick Scott

Amr Ali

Recent Activity

Donate For Us

Running the crawler doesn't get the same data it does when training

Tags:

import.io

Ian Woodfill

2 Answers

Nick Scott

Amr Ali

Related questions

Recent Activity

Donate For Us