I've written a script in python to get some information from a webpage. The code itself is running flawlessly if it is taken out of the asyncio. However, as my script runs synchronously I wanted to make it go through asyncronous process so that it accomplishes the task within the shortest possible time providing optimum performance and obviously not in a blocking manner. As i didn't ever work with this asyncio library, I'm seriously confused how to make it a go. I've tried to fit my script within the asyncio process but it doesn't seem right. If somebody stretches a helping hand to complete this, I would really be grateful to him. Thanks is advance. Here is my erroneous code:
import requests ; from lxml import html
import asyncio
link = "http://quotes.toscrape.com/"
async def quotes_scraper(base_link):
response = requests.get(base_link)
tree = html.fromstring(response.text)
for titles in tree.cssselect("span.tag-item a.tag"):
processing_docs(base_link + titles.attrib['href'])
async def processing_docs(base_link):
response = requests.get(base_link).text
root = html.fromstring(response)
for soups in root.cssselect("div.quote"):
quote = soups.cssselect("span.text")[0].text
author = soups.cssselect("small.author")[0].text
print(quote, author)
next_page = root.cssselect("li.next a")[0].attrib['href'] if root.cssselect("li.next a") else ""
if next_page:
page_link = link + next_page
processing_docs(page_link)
loop = asyncio.get_event_loop()
loop.run_until_complete(quotes_scraper(link))
loop.close()
Upon execution what I see in the console is:
RuntimeWarning: coroutine 'processing_docs' was never awaited
processing_docs(base_link + titles.attrib['href'])
Most popular: Web scraping with PythonPython is regarded as the most commonly used programming language for web scraping. Incidentally, it is also the top programming language for 2021 according to IEEE Spectrum.
Asynchronous web scraping allows us to process and collect data from a large number of web pages in parallel. Doing all the scrapping in parallel threads, allows us to save time. We, no longer, need to wait for scraping of one page to finish before we start scraping the other!
You need to call processing_docs()
with await
.
Replace:
processing_docs(base_link + titles.attrib['href'])
with:
await processing_docs(base_link + titles.attrib['href'])
And replace:
processing_docs(page_link)
with:
await processing_docs(page_link)
Otherwise it tries to run an asynchronous function synchronously and gets upset!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With