Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

The tasks from asyncio.gather does not work concurrently

I want to scrape data from a website concurrently, but I found that the following program is NOT executed concurrently.

async def return_soup(url):
    r = requests.get(url)
    r.encoding = "utf-8"
    soup = BeautifulSoup(r.text, "html.parser")

    future = asyncio.Future()
    future.set_result(soup)
    return future

async def parseURL_async(url):    
    print("Started to download {0}".format(url))
    soup = await return_soup(url)
    print("Finished downloading {0}".format(url))

    return soup

loop = asyncio.new_event_loop()
asyncio.set_event_loop(loop)
t = [parseURL_async(url_1), parseURL_async(url_2)]
loop.run_until_complete(asyncio.gather(*t))

However, this program starts to download the second content only after the first one finishes. If my understanding is correct, the await keyword on the await return_soup(url) awaits for the function to be complete, and while waiting for the completion, it returns back the control to the event loop, which enables the loop to start the second download.

And once the function finally finishes the execution, the future instance within it gets the result value.

But why does this not work concurrently? What am I missing here?

like image 618
Blaszard Avatar asked May 07 '18 13:05

Blaszard


Video Answer


3 Answers

Using asyncio is different from using threads in that you cannot add it to an existing code base to make it concurrent. Specifically, code that runs in the asyncio event loop must not block - all blocking calls must be replaced with non-blocking versions that yield control to the event loop. In your case, requests.get blocks and defeats the parallelism implemented by asyncio.

To avoid this problem, you need to use an http library that is written with asyncio in mind, such as aiohttp.

like image 53
user4815162342 Avatar answered Oct 17 '22 22:10

user4815162342


The reason as mentioned in other answers is the lack of library support for coroutines.

As of Python 3.9 though, you can use the function to_thread as an alternative for I/O concurrency.

Obviously this is not exactly equivalent because as the name suggests it runs your functions in separate threads as opposed of a single thread in the event loop, but it can be a way to achieve I/O concurrency without relying on proper async support from the library.

In your example the code would be:

def return_soup(url):
    r = requests.get(url)
    r.encoding = "utf-8"
    return BeautifulSoup(r.text, "html.parser")

def parseURL_async(url):
    print("Started to download {0}".format(url))
    soup = return_soup(url)
    print("Finished downloading {0}".format(url))
    return soup

async def main():
    result_url_1, result_url_2 = await asyncio.gather(
        asyncio.to_thread(parseURL_async, url_1),
        asyncio.to_thread(parseURL_async, url_2),
    )

asyncio.run(main())
like image 25
oidualc Avatar answered Oct 17 '22 20:10

oidualc


I'll add a little more to user4815162342's response. The asyncio framework uses coroutines that must cede control of the thread while they do the long operation. See the diagram at the end of this section for a nice graphical representation. As user4815162342 mentioned, the requests library doesn't support asyncio. I know of two ways to make this work concurrently. First, is to do what user4815162342 suggested and switch to a library with native support for asynchronous requests. The second is to run this synchronous code in separate threads or processes. The latter is easy because of the run_in_executor function.

loop = asyncio.get_event_loop()

async def return_soup(url):
    r = await loop.run_in_executor(None, requests.get, url)
    r.encoding = "utf-8"
    return BeautifulSoup(r.text, "html.parser")

async def parseURL_async(url):    
    print("Started to download {0}".format(url))
    soup = await return_soup(url)
    print("Finished downloading {0}".format(url))

    return soup

t = [parseURL_async(url_1), parseURL_async(url_2)]
loop.run_until_complete(asyncio.gather(*t))

This solution removes some of the benefit of using asyncio, as the long operation will still probably be executed from a fixed size thread pool, but it's also much easier to start with.

like image 8
Erik Avatar answered Oct 17 '22 20:10

Erik