Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Asyncio, the tasks are not finished properly, because of sentinel issues

I'm trying to do some web-scraping, as learning, using a predefined number of workers.

I'm using None as as sentinel to break out of the while loop and stop the worker.

The speed of each worker varies, and all workers are closed before the last url is passed to gather_search_links to get the links.

I tried to use asyncio.Queue, but I had less control than with deque.

async def gather_search_links(html_sources, detail_urls):
    while True:
        if not html_sources:
            await asyncio.sleep(0)
            continue

        data = html_sources.pop()
        if data is None:
            html_sources.appendleft(None)
            break
        data = BeautifulSoup(data, "html.parser")
        result = data.find_all("div", {"data-component": "search-result"})
        for record in result:
            atag = record.h2.a
            url = f'{domain_url}{atag.get("href")}'
            detail_urls.appendleft(url)
        print("apended data", len(detail_urls))
        await asyncio.sleep(0)


async def get_page_source(urls, html_sources):
    client = httpx.AsyncClient()
    while True:
        if not urls:
            await asyncio.sleep(0)
            continue

        url = urls.pop()
        print("url", url)
        if url is None:
            urls.appendleft(None)
            break

        response = await client.get(url)
        html_sources.appendleft(response.text)
        await asyncio.sleep(8)
    html_sources.appendleft(None)


async def navigate(urls):
    for i in range(2, 7):
        url = f"https://www.example.com/?page={i}"
        urls.appendleft(url)
        await asyncio.sleep(0)
    nav_urls.appendleft(None)


loop = asyncio.get_event_loop()
nav_html = deque()
nav_urls = deque()
products_url = deque()

navigate_workers = [asyncio.ensure_future(navigate(nav_urls)) for _ in range(1)]
page_source_workers = [asyncio.ensure_future(get_page_source(nav_urls, nav_html)) for _ in range(2)]
product_urls_workers = [asyncio.ensure_future(gather_search_links(nav_html, products_url)) for _ in range(1)]
workers = asyncio.wait([*navigate_workers, *page_source_workers, *product_urls_workers])

loop.run_until_complete(workers)
like image 425
user3541631 Avatar asked Oct 26 '22 17:10

user3541631


1 Answers

I'm a bit of a newbie, so this could be wrong as can be, but I believe that the issue is that all three of the functions: navigate(), gather_search_links(), and get_page_source() are asynchronous tasks that can be completed in any order. However, your checks for empty deques and your use of appendleft to ensure None is the leftmost item in your deques, look like they would appropriately prevent this. For all intents and purposes the code looks like it should run correctly.

I think the issue arises at this line:

workers = asyncio.wait([*navigate_workers, *page_source_workers, *product_urls_workers])

According to this post, the asyncio.wait function does not order these tasks according to the order they're written above, instead it fires them according to IO as coroutines. Again, your checks at the beginning of gather_search_links and get_page_source are ensuring that one function runs after the other and thus this code should work if there is only a single worker for each function. If there are multiple workers for each function, I can see issues arising where None doesn't wind up being the leftmost item in your deques. Perhaps a print statement at the end of each function to show the contents of your deques would be useful in troubleshooting this.

I guess my major question would be, why do these tasks asnychronously if you're going to write extra code because the steps must be completed synchronously? In order to get the HTML you must first have the URL. In order to scrape the HTML you must first have the HTML. What benefit does asyncio provide here? All three of these make more sense to me as synchronous tasks. Get URL, get HTML, scrape HTML, and in that order.

EDIT: It occurred to me that the main benefit of asynchronous code here is that you don't want to have to wait on each individual URL to respond back synchronously when you fetch the HTML from them. What I would do in this situation is gather my URLs synchronously first, and then combine the get and scrape functions into a single asynchronous function, which would be your only asynchronous function. Then you don't need a sentinel or a check for a "None" value or any of that extra code and you get the full value of the asynchronous fetch. You could then store your scraped data in a list (or deque or whatever) of futures. This would simplify your code and provide you with the fastest possible scrape time.

LAST EDIT: Here's my quick and dirty rewrite. I liked your code so I decided to do my own spin. I have no idea if it works, I'm not a Python person.

import asyncio
from collections import deque

import httpx as httpx
from bs4 import BeautifulSoup

# Get or build URLs from config
def navigate():
    urls = deque()
    for i in range(2, 7):
        url = f"https://www.example.com/?page={i}"
        urls.appendleft(url)
    return urls

# Asynchronously fetch and parse data for a single URL
async def fetchHTMLandParse(url):

    client = httpx.AsyncClient()
    response = await client.get(url)
    data = BeautifulSoup(response.text, "html.parser")
    result = data.find_all("div", {"data-component": "search-result"})
    for record in result:
        atag = record.h2.a
        #Domain URL was defined elsewhere
        url = f'{domain_url}{atag.get("href")}'
        products_urls.appendleft(url)


loop = asyncio.get_event_loop()
products_urls = deque()

nav_urls = navigate()
fetch_and_parse_workers = [asyncio.ensure_future(fetchHTMLandParse(url)) for url in nav_urls]
workers = asyncio.wait([*fetch_and_parse_workers])

loop.run_until_complete(workers)
like image 135
TheFunk Avatar answered Nov 07 '22 20:11

TheFunk