Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Web Scraping with Python in combination with asyncio

I've written a script in python to get some information from a webpage. The code itself is running flawlessly if it is taken out of the asyncio. However, as my script runs synchronously I wanted to make it go through asyncronous process so that it accomplishes the task within the shortest possible time providing optimum performance and obviously not in a blocking manner. As i didn't ever work with this asyncio library, I'm seriously confused how to make it a go. I've tried to fit my script within the asyncio process but it doesn't seem right. If somebody stretches a helping hand to complete this, I would really be grateful to him. Thanks is advance. Here is my erroneous code:

import requests ; from lxml import html
import asyncio

link = "http://quotes.toscrape.com/"

async def quotes_scraper(base_link):
        response = requests.get(base_link)
        tree = html.fromstring(response.text)
        for titles in tree.cssselect("span.tag-item a.tag"):
            processing_docs(base_link + titles.attrib['href'])

async def processing_docs(base_link):
        response = requests.get(base_link).text
        root = html.fromstring(response)
        for soups in root.cssselect("div.quote"):
            quote = soups.cssselect("span.text")[0].text
            author = soups.cssselect("small.author")[0].text
            print(quote, author)


        next_page = root.cssselect("li.next a")[0].attrib['href'] if root.cssselect("li.next a") else ""
        if next_page:
            page_link = link + next_page
            processing_docs(page_link)

loop = asyncio.get_event_loop()
loop.run_until_complete(quotes_scraper(link))
loop.close()

Upon execution what I see in the console is:

RuntimeWarning: coroutine 'processing_docs' was never awaited
  processing_docs(base_link + titles.attrib['href'])
like image 515
SIM Avatar asked Sep 05 '17 13:09

SIM


People also ask

Is Python good for web scraping?

Most popular: Web scraping with PythonPython is regarded as the most commonly used programming language for web scraping. Incidentally, it is also the top programming language for 2021 according to IEEE Spectrum.

What is asynchronous web scraping?

Asynchronous web scraping allows us to process and collect data from a large number of web pages in parallel. Doing all the scrapping in parallel threads, allows us to save time. We, no longer, need to wait for scraping of one page to finish before we start scraping the other!


1 Answers

You need to call processing_docs() with await.

Replace:

processing_docs(base_link + titles.attrib['href'])

with:

await processing_docs(base_link + titles.attrib['href'])

And replace:

processing_docs(page_link)

with:

await processing_docs(page_link)

Otherwise it tries to run an asynchronous function synchronously and gets upset!

like image 168
James Wilson Avatar answered Oct 27 '22 00:10

James Wilson