Web Scraping with Python in combination with asyncio

Tags:

I've written a script in python to get some information from a webpage. The code itself is running flawlessly if it is taken out of the asyncio. However, as my script runs synchronously I wanted to make it go through asyncronous process so that it accomplishes the task within the shortest possible time providing optimum performance and obviously not in a blocking manner. As i didn't ever work with this asyncio library, I'm seriously confused how to make it a go. I've tried to fit my script within the asyncio process but it doesn't seem right. If somebody stretches a helping hand to complete this, I would really be grateful to him. Thanks is advance. Here is my erroneous code:

import requests ; from lxml import html
import asyncio

link = "http://quotes.toscrape.com/"

async def quotes_scraper(base_link):
        response = requests.get(base_link)
        tree = html.fromstring(response.text)
        for titles in tree.cssselect("span.tag-item a.tag"):
            processing_docs(base_link + titles.attrib['href'])

async def processing_docs(base_link):
        response = requests.get(base_link).text
        root = html.fromstring(response)
        for soups in root.cssselect("div.quote"):
            quote = soups.cssselect("span.text")[0].text
            author = soups.cssselect("small.author")[0].text
            print(quote, author)


        next_page = root.cssselect("li.next a")[0].attrib['href'] if root.cssselect("li.next a") else ""
        if next_page:
            page_link = link + next_page
            processing_docs(page_link)

loop = asyncio.get_event_loop()
loop.run_until_complete(quotes_scraper(link))
loop.close()

Upon execution what I see in the console is:

RuntimeWarning: coroutine 'processing_docs' was never awaited
  processing_docs(base_link + titles.attrib['href'])

515

asked Sep 05 '17 13:09

SIM

1 Answers

You need to call processing_docs() with await.

Replace:

processing_docs(base_link + titles.attrib['href'])

with:

await processing_docs(base_link + titles.attrib['href'])

And replace:

processing_docs(page_link)

with:

await processing_docs(page_link)

Otherwise it tries to run an asynchronous function synchronously and gets upset!

168

answered Oct 27 '22 00:10

James Wilson

Related questions
                            
                                How to create a DataFrame from dict of unequal length lists, and truncating to a specific length?
                            
                                How to avoid inherited members using autosummary and custom templates?
                            
                                levels option in pandas concat
                            
                                Installing both Python and R for a Travis build?
                            
                                Type annotation with multiple types in **kwargs
                            
                                Django case insensitive "distinct" query
                            
                                Why does locale.getpreferredencoding() return 'ANSI_X3.4-1968' instead of 'UTF-8'?
                            
                                Zorder specification in matplotlib patch collections?
                            
                                PyCharm: How to document :rtype: for function that returns generator
                            
                                How to create minor ticks for polar plot matplotlib
                            
                                Descending order using heapq
                            
                                Encode CSV file for Sendgrid's Email API
                            
                                AttributeError:'Tensor' object has no attribute '_keras_history'
                            
                                Python error cannot do a non empty take from an empty axes
                            
                                What exactly does the Pandas random_state do?
                            
                                3darray training/testing TensorFlow RNN LSTM
                            
                                In PostgreSQL, where does plpython(3)u output from `print` go?
                            
                                Dask: nunique method on Dataframe groupBy
                            
                                valid UUID is not a valid UUID
                            
                                Dispatching keypresses to embedded Pygame

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Web Scraping with Python in combination with asyncio

Tags:

python

asynchronous

python-3.x

web-scraping

python-asyncio

SIM

People also ask

1 Answers

James Wilson

Recent Activity

Donate For Us