When training my crawler to scrape a Yelp page, it gets all the information without me doing anything, but when I run the crawler, the address is not recognized, and doesn't record.
Getting company data from Yelp
In this case we want to get addresses for companies in San Francisco from the website www.yelp.com.
Analysis of the site
We can get a list of companies beginning with the letter “A” from this page:
http://www.yelp.com/sm/san-francisco-ca-us/a/1
This directory page tells us there are 42 pages of results for “A” with up to 80 results per page.
This is good news.
Create an API
I am now going to create an API to get the data from the first page and then use Bulk Extract to pass through a list of URLs to all 42 pages.
Using Magic, I can generate an API in just a few clicks:
Now we have an API!
(Note that if you need more control over what to include or exclude from the API you can use an Extractor)
Generate a list of URLs
To generate a list of URLs that will allow us to get data from pages 1 through 42 I am going to use an external service hosted at:
http://texttool.blogspot.co.uk/
Locate the "generate list of numbers" tool and generate a list of URLs:
http://www.yelp.com/sm/san-francisco-ca-us/a/1
http://www.yelp.com/sm/san-francisco-ca-us/a/2
http://www.yelp.com/sm/san-francisco-ca-us/a/3
http://www.yelp.com/sm/san-francisco-ca-us/a/4
http://www.yelp.com/sm/san-francisco-ca-us/a/5
http://www.yelp.com/sm/san-francisco-ca-us/a/6
http://www.yelp.com/sm/san-francisco-ca-us/a/7
http://www.yelp.com/sm/san-francisco-ca-us/a/8
http://www.yelp.com/sm/san-francisco-ca-us/a/9
http://www.yelp.com/sm/san-francisco-ca-us/a/10
http://www.yelp.com/sm/san-francisco-ca-us/a/11
http://www.yelp.com/sm/san-francisco-ca-us/a/12
http://www.yelp.com/sm/san-francisco-ca-us/a/13
http://www.yelp.com/sm/san-francisco-ca-us/a/14
http://www.yelp.com/sm/san-francisco-ca-us/a/15
http://www.yelp.com/sm/san-francisco-ca-us/a/16
http://www.yelp.com/sm/san-francisco-ca-us/a/17
http://www.yelp.com/sm/san-francisco-ca-us/a/18
http://www.yelp.com/sm/san-francisco-ca-us/a/19
http://www.yelp.com/sm/san-francisco-ca-us/a/20
http://www.yelp.com/sm/san-francisco-ca-us/a/21
http://www.yelp.com/sm/san-francisco-ca-us/a/22
http://www.yelp.com/sm/san-francisco-ca-us/a/23
http://www.yelp.com/sm/san-francisco-ca-us/a/24
http://www.yelp.com/sm/san-francisco-ca-us/a/25
http://www.yelp.com/sm/san-francisco-ca-us/a/26
http://www.yelp.com/sm/san-francisco-ca-us/a/27
http://www.yelp.com/sm/san-francisco-ca-us/a/28
http://www.yelp.com/sm/san-francisco-ca-us/a/29
http://www.yelp.com/sm/san-francisco-ca-us/a/30
http://www.yelp.com/sm/san-francisco-ca-us/a/31
http://www.yelp.com/sm/san-francisco-ca-us/a/32
http://www.yelp.com/sm/san-francisco-ca-us/a/33
http://www.yelp.com/sm/san-francisco-ca-us/a/34
http://www.yelp.com/sm/san-francisco-ca-us/a/35
http://www.yelp.com/sm/san-francisco-ca-us/a/36
http://www.yelp.com/sm/san-francisco-ca-us/a/37
http://www.yelp.com/sm/san-francisco-ca-us/a/38
http://www.yelp.com/sm/san-francisco-ca-us/a/39
http://www.yelp.com/sm/san-francisco-ca-us/a/40
http://www.yelp.com/sm/san-francisco-ca-us/a/41
http://www.yelp.com/sm/san-francisco-ca-us/a/42
Bulk Extract
Now you can use Bulk Extract to get data from each of those URLs in one go.
To do this:
Note: you may get some failed queries. By clicking the “X URLs failed” text you can retry the failed queries.
Export
You can now export this data to a spreadsheet, as HTML or JSON.
Further reading
http://support.import.io/knowledgebase/articles/669784-getting-company-data-from-yelp
you should use xpath to select what ever you want on yelp , i have done it before for yelp and xpath is more accurate than manual training .
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With