Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Chrome extension webscraper.io - how does pagination work with selecting "next"

I am trying to scrape tables of a website using the google chrome extension webscraper.io. In the tutorial of the extension, it is documented how to scrape a website with different pages, say, "page 1", "page 2" and "page 3" where each of the pages is directly linked on the main page.

In the example of the website I am trying to scrape, however, there is only a "next" button to access the next site. If I follow the steps in the tutorial and create a link for the "next" page, it will only consider page 1 and 2. Creating a "next" link for each page is not feasible because they are too many. How can I get the webscraper to include all pages? Is there a way to loop through pages using the webscraper extension?

I am aware of this possible duplicate: pagination Chrome web scraper. However, it was not well received and contains no useful answers.

like image 661
eigenvector Avatar asked Jan 12 '17 10:01

eigenvector


People also ask

How do you use the element click in web scraper?

Using Element click selector you can select these items and buttons that need to be clicked. The scraper during scraping phase will click these buttons to extract all elements. Also you need to add child selectors for the Element click selector that select data within each element.

How do I use Webscraper in Chrome?

To start the scraping process, just click on the sitemap tab and select 'Scrape'. A new window will pop up which will visit each page in the loop and crawl the required data. If you want to stop the data scraping process in between, just close this window and you will have the data that was extracted till then.


1 Answers

Following the advanced documentation here, the problem is solved by making the "pagination" link a parent of its own. Then, the scraping software will recursively go through all pages and their "next" page. In their words,

To extract items from all of the pagination links including the ones that are not visible at the beginning you need to create another Link selector that selects the pagination links. Figure 2 shows how the link selector should be created in the sitemap. When the scraper opens a category link it will extract items that are available in the page. After that it will find the pagination links and also visit those. If the pagination link selector is made a child to itself it will recursively discover all pagination pages.

like image 127
eigenvector Avatar answered Oct 10 '22 01:10

eigenvector