Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Running the crawler doesn't get the same data it does when training

Tags:

import.io

When training my crawler to scrape a Yelp page, it gets all the information without me doing anything, but when I run the crawler, the address is not recognized, and doesn't record.

like image 670
Ian Woodfill Avatar asked Jul 13 '15 23:07

Ian Woodfill


2 Answers

Getting company data from Yelp

In this case we want to get addresses for companies in San Francisco from the website www.yelp.com.

Analysis of the site

We can get a list of companies beginning with the letter “A” from this page:

http://www.yelp.com/sm/san-francisco-ca-us/a/1

This directory page tells us there are 42 pages of results for “A” with up to 80 results per page.

This is good news.

Create an API

I am now going to create an API to get the data from the first page and then use Bulk Extract to pass through a list of URLs to all 42 pages.

Using Magic, I can generate an API in just a few clicks:

  1. Go to Magic.import.io
  2. Paste in the URL to the Yelp page (link above)
  3. Click “Extract Data”
  4. Click “Get API”
  5. Click “Copy this to My Data”

Now we have an API!

(Note that if you need more control over what to include or exclude from the API you can use an Extractor)

Generate a list of URLs

To generate a list of URLs that will allow us to get data from pages 1 through 42 I am going to use an external service hosted at:

http://texttool.blogspot.co.uk/

Locate the "generate list of numbers" tool and generate a list of URLs:

http://www.yelp.com/sm/san-francisco-ca-us/a/1
http://www.yelp.com/sm/san-francisco-ca-us/a/2
http://www.yelp.com/sm/san-francisco-ca-us/a/3
http://www.yelp.com/sm/san-francisco-ca-us/a/4
http://www.yelp.com/sm/san-francisco-ca-us/a/5
http://www.yelp.com/sm/san-francisco-ca-us/a/6
http://www.yelp.com/sm/san-francisco-ca-us/a/7
http://www.yelp.com/sm/san-francisco-ca-us/a/8
http://www.yelp.com/sm/san-francisco-ca-us/a/9
http://www.yelp.com/sm/san-francisco-ca-us/a/10
http://www.yelp.com/sm/san-francisco-ca-us/a/11
http://www.yelp.com/sm/san-francisco-ca-us/a/12
http://www.yelp.com/sm/san-francisco-ca-us/a/13
http://www.yelp.com/sm/san-francisco-ca-us/a/14
http://www.yelp.com/sm/san-francisco-ca-us/a/15
http://www.yelp.com/sm/san-francisco-ca-us/a/16
http://www.yelp.com/sm/san-francisco-ca-us/a/17
http://www.yelp.com/sm/san-francisco-ca-us/a/18
http://www.yelp.com/sm/san-francisco-ca-us/a/19
http://www.yelp.com/sm/san-francisco-ca-us/a/20
http://www.yelp.com/sm/san-francisco-ca-us/a/21
http://www.yelp.com/sm/san-francisco-ca-us/a/22
http://www.yelp.com/sm/san-francisco-ca-us/a/23
http://www.yelp.com/sm/san-francisco-ca-us/a/24
http://www.yelp.com/sm/san-francisco-ca-us/a/25
http://www.yelp.com/sm/san-francisco-ca-us/a/26
http://www.yelp.com/sm/san-francisco-ca-us/a/27
http://www.yelp.com/sm/san-francisco-ca-us/a/28
http://www.yelp.com/sm/san-francisco-ca-us/a/29
http://www.yelp.com/sm/san-francisco-ca-us/a/30
http://www.yelp.com/sm/san-francisco-ca-us/a/31
http://www.yelp.com/sm/san-francisco-ca-us/a/32
http://www.yelp.com/sm/san-francisco-ca-us/a/33
http://www.yelp.com/sm/san-francisco-ca-us/a/34
http://www.yelp.com/sm/san-francisco-ca-us/a/35
http://www.yelp.com/sm/san-francisco-ca-us/a/36
http://www.yelp.com/sm/san-francisco-ca-us/a/37
http://www.yelp.com/sm/san-francisco-ca-us/a/38
http://www.yelp.com/sm/san-francisco-ca-us/a/39
http://www.yelp.com/sm/san-francisco-ca-us/a/40
http://www.yelp.com/sm/san-francisco-ca-us/a/41
http://www.yelp.com/sm/san-francisco-ca-us/a/42

Bulk Extract

Now you can use Bulk Extract to get data from each of those URLs in one go.

To do this:

  1. Go to the Configure tab on your Yelp API
  2. Select Bulk Extract from the drop down
  3. Paste in the list of 42 URLs
  4. Click “Run Queries”

Note: you may get some failed queries. By clicking the “X URLs failed” text you can retry the failed queries.

Export

You can now export this data to a spreadsheet, as HTML or JSON.

Further reading

http://support.import.io/knowledgebase/articles/669784-getting-company-data-from-yelp

like image 109
Nick Scott Avatar answered Oct 05 '22 13:10

Nick Scott


you should use xpath to select what ever you want on yelp , i have done it before for yelp and xpath is more accurate than manual training .

like image 27
Amr Ali Avatar answered Oct 05 '22 15:10

Amr Ali