Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Unable to exhaust the content of all the identical urls used within my scraper

I've written a scraper in python using BeautifulSoup library to parse all the names traversing different pages of a website. I could manage it if it were not for more than one urls with different pagination, meaning some urls have pagination some does not as the content are few.

My question is: how could I manage to compile them within a function to handle whether they have pagination or not?

My initial attempt (it is able to parse the content from each url's first page only):

import requests 
from bs4 import BeautifulSoup

urls = {
    'https://www.mobilehome.net/mobile-home-park-directory/maine/all',
    'https://www.mobilehome.net/mobile-home-park-directory/rhode-island/all',
    'https://www.mobilehome.net/mobile-home-park-directory/new-hampshire/all',
    'https://www.mobilehome.net/mobile-home-park-directory/vermont/all'
}

def get_names(link):
    r = requests.get(link)
    soup = BeautifulSoup(r.text,"lxml")
    for items in soup.select("td[class='table-row-price']"):
        name = items.select_one("h2 a").text
        print(name)

if __name__ == '__main__':
    for url in urls:
        get_names(url)

I could have managed to do the whole thing, if there is a single url with pagination like below:

from bs4 import BeautifulSoup 
import requests

page_no = 0
page_link = "https://www.mobilehome.net/mobile-home-park-directory/new-hampshire/all/page/{}"

while True:
    page_no+=1
    res = requests.get(page_link.format(page_no))
    soup = BeautifulSoup(res.text,'lxml')
    container = soup.select("td[class='table-row-price']")
    if len(container)<=1:break 

    for content in container:
        title = content.select_one("h2 a").text
        print(title)

But, all the urls do not have pagination. So, how can i manage to grab all of them whether there is any pagination or not?

like image 276
SIM Avatar asked May 30 '18 18:05

SIM


2 Answers

This solution attempts to find pagination a tags. If any pagination is found, all the pages are scraped when the user iterates over the instance of the class PageScraper. If not, only the first result (the single page) will be crawled:

import requests
from bs4 import BeautifulSoup as soup
import contextlib
def has_pagination(f):
  def wrapper(cls):
     if not cls._pages:
       raise ValueError('No pagination found')
     return f(cls)
  return wrapper

class PageScraper:
   def __init__(self, url:str):
     self.url = url
     self._home_page = requests.get(self.url).text
     self._pages = [i.text for i in soup(self._home_page, 'html.parser').find('div', {'class':'pagination'}).find_all('a')][:-1]
   @property
   def first_page(self):
      return [i.find('h2', {'class':'table-row-heading'}).text for i in soup(self._home_page, 'html.parser').find_all('td', {'class':'table-row-price'})]
   @has_pagination
   def __iter__(self):
     for p in self._pages:
        _link = requests.get(f'{self.url}/page/{p}').text
        yield [i.find('h2', {'class':'table-row-heading'}).text for i in soup(_link, 'html.parser').find_all('td', {'class':'table-row-price'})]
   @classmethod
   @contextlib.contextmanager
   def feed_link(cls, link):
      results = cls(link)
      try:
        yield results.first_page
        for i in results:
          yield i
      except:
         yield results.first_page

The constructor of the class will find any pagination, and the __iter__ method garners all pages, only if pagination links are found. For instance, https://www.mobilehome.net/mobile-home-park-directory/rhode-island/all has no pagination. Thus:

r = PageScraper('https://www.mobilehome.net/mobile-home-park-directory/rhode-island/all')
pages = [i for i in r]

ValueError: No pagination found

However, the first page contents can be found:

print(r.first_page)
['Forest Park MHP', 'Gansett Mobile Home Park', 'Meadowlark Park', 'Indian Cedar Mobile Homes Inc', 'Sherwood Valley Adult Mobile', 'Tripp Mobile Home Park', 'Ramblewood Estates', 'Countryside Trailer Park', 'Village At Wordens Pond', 'Greenwich West Inc', 'Dadson Mobile Home Estates', "Oliveira's Garage", 'Tuckertown Village Clubhouse', 'Westwood Estates']

However, for pages with complete pagination, all resulting pages can be scraped:

r = PageScraper('https://www.mobilehome.net/mobile-home-park-directory/maine/all')
d = [i for i in r]

PageScraper.feed_link will accomplish this check automatically, and output the first page, with all subsequent results should pagination be found, or just the first page if no pagination exists in the result:

urls = {'https://www.mobilehome.net/mobile-home-park-directory/maine/all', 'https://www.mobilehome.net/mobile-home-park-directory/rhode-island/all', 'https://www.mobilehome.net/mobile-home-park-directory/vermont/all', 'https://www.mobilehome.net/mobile-home-park-directory/new-hampshire/all'}
for url in urls:
   with PageScraper.feed_link(url) as r:
      print(r)
like image 193
Ajax1234 Avatar answered Oct 05 '22 23:10

Ajax1234


It seems I've found out a very robust solutioon to this problem. The approach is iterative. It will first check if there is any next page url available in that page. If it finds one then it will track that url and repeat the process. However, if any link doesn't have any pagination, the scraper will break and try for another link.

Here it is:

import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup

urls = [
        'https://www.mobilehome.net/mobile-home-park-directory/alaska/all',
        'https://www.mobilehome.net/mobile-home-park-directory/rhode-island/all',
        'https://www.mobilehome.net/mobile-home-park-directory/maine/all',
        'https://www.mobilehome.net/mobile-home-park-directory/vermont/all'
    ]

def get_names(link):
    while True:
        r = requests.get(link)
        soup = BeautifulSoup(r.text,"lxml")
        for items in soup.select("td[class='table-row-price']"):
            name = items.select_one("h2 a").text
            print(name)

        nextpage = soup.select_one(".pagination a.next_page")

        if not nextpage:break  #If no pagination url is there, it will break and try another link

        link = urljoin(link,nextpage.get("href"))

if __name__ == '__main__':
    for url in urls:
        get_names(url)
like image 42
SIM Avatar answered Oct 05 '22 23:10

SIM