Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

python3.7 aiohttp is slow

import asyncio
import aiohttp
import socket 

def _create_loop():
    loop = asyncio.new_event_loop()
    asyncio.set_event_loop(loop)
    loop = asyncio.get_event_loop()
    return loop

async def _create_tasks(loop, URLs, func):
    connector = aiohttp.TCPConnector(limit=200, 
                                     limit_per_host=200, 
                                     force_close=True, 
                                     enable_cleanup_closed=True,
                                     family=socket.AF_INET, verify_ssl=False)
    async with aiohttp.ClientSession(loop=loop, connector=connector) as session:
        semaphore = asyncio.Semaphore(200)
        async with semaphore:
            tasks = [asyncio.create_task(func(session, URL)) for URL in URLs]
        return await asyncio.gather(*tasks)

async def _fetch_data_async(session, url):
    async with session.get(url) as response:
        return await response.json()

loop = _create_loop()
tasks = _create_tasks(loop, URL_ls, _fetch_data_async)
results = loop.run_until_complete(tasks)
loop.close()

My API provider limits 200 per request. In fact, I have 1500 URLs to request. So Currently I'm splitting 1500 URL list into 8 so that each request number is less than 200.

I know that It's not the best way to handle this issue. If I send all 1500 URLs at once, the following error occurs.



> task: <Task pending coro=<_get_hist_inner2.<locals>._fetch_data_async() running at <ipython-input-22-f525394caccb>:47> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7ff0286eaf48>()]> cb=[gather.<locals>._done_callback() at /usr/lib/python3.7/asyncio/tasks.py:664]>
Task was destroyed but it is pending!
task: <Task pending coro=<_get_hist_inner2.<locals>._fetch_data_async() running at <ipython-input-22-f525394caccb>:47> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7ff028706048>()]> cb=[gather.<locals>._done_callback() at /usr/lib/python3.7/asyncio/tasks.py:664]>
SSL error in data received
protocol: <asyncio.sslproto.SSLProtocol object at 0x7ff0281859b0>
transport: <_SelectorSocketTransport fd=240 read=polling write=<idle, bufsize=0>>
Traceback (most recent call last):
  File "/usr/lib/python3.7/asyncio/sslproto.py", line 526, in data_received
    ssldata, appdata = self._sslpipe.feed_ssldata(data)
  File "/usr/lib/python3.7/asyncio/sslproto.py", line 207, in feed_ssldata
    self._sslobj.unwrap()
  File "/usr/lib/python3.7/ssl.py", line 767, in unwrap
    return self._sslobj.shutdown()
ssl.SSLError: [SSL: KRB5_S_INIT] application data after close notify (_ssl.c:2609)

Lastly, each API call takes 5 seconds to get the response. but 200 async call takes 1 minute to get the response... I'm not sure if it's normal or is there any potential bottleneck in my code.

Anyhow, I need to call 1500 times very as fast as possible either by optimizing this code or using any available technology. Can anyone help?

like image 227
james Avatar asked Mar 03 '23 04:03

james


1 Answers

There are a few things in your code that can be improved or need to be changed.

1. _create_loop()

This function is not needed nor should you want it. Notice the following:

  • asyncio.run(): "This function runs the passed coroutine, taking care of managing the asyncio event loop and finalizing asynchronous generators."
  • asyncio.get_event_loop(): "Get the current event loop. If there is no current event loop set in the current OS thread and set_event_loop() has not yet been called, asyncio will create a new event loop and set it as the current one."

2. _create_tasks()

asyncio.semaphore

Without getting into details on why I don't believe this function should be created, let's talk about what can be fixed to have it properly running.

The following piece of code is not proper:

async with aiohttp.ClientSession(loop=loop, connector=connector) as session:
        semaphore = asyncio.Semaphore(200)
        async with semaphore:
            tasks = [asyncio.create_task(func(session, URL)) for URL in URLs]
        return await asyncio.gather(*tasks)

Specifically the use of the asyncio.Semaphore. The way this primitive work is by bounding specific futures, not by being initialized when you are creating tasks. In other words, proper use would be the following:

import asyncio

async def func(sem):
    async with sem: # We wait for our Semaphore to release here. 
        print("Hey world!")
        await asyncio.sleep(2)

async def main():
    sem = asyncio.Semaphore(2) # We define the semaphore here. 
    tasks = [func(sem) for _ in range(10)]
    await asyncio.wait(tasks)

asyncio.run(main())

If you run the code above, you will notice "Hello World" (times two) will get printed every two seconds. This is because we are stating: "Only allocate this semaphore to two futures at a time. While they use the resource, do not release the semaphore."

aiohttp.TCPConnector()

Notice on the documentation for aiohttp.TCPConnector() that the limit flag will "limit the amount of simultaneously opened connections." In other words, if you are setting this flag you don't need to create a semaphore to limit the number of connections you do simultaneously.

asyncio.create_task()

Easier to show this with an example. Run the following:

import asyncio

async def func():
    print("In here.")
    return "Hello World!"

async def main():
    tasks = [asyncio.create_task(func()) for i in range(10)]

    # Notice that all of our tasks run before we gather them.
    await asyncio.sleep(5)
    print(await asyncio.gather(*tasks))

async def test():
    tasks = [func() for i in range(10)]

    # Notice that all of our tasks run AFTER we've defined a list of tasks.
    await asyncio.sleep(5)
    print(await asyncio.gather(*tasks))

asyncio.run(main())
print("\nRunning test.\n")
asyncio.run(test())

Notice that when you use create_task() you are actually running the future. To clarify why notice the following:

import asyncio

async def async_func():
    return "Hello World!"


async def main():
    async_tasks = [async_func()]
    print(async_tasks) # > [<coroutine object async_func at 0x10c65f320>]
    await asyncio.gather(*async_tasks)

asyncio.run(main())

When you call an async def function, it actually returns a coroutine object that must called with either the loop or asyncio.


Proper Function

So with the above in mind, we can fix your function. I've taken the liberty to re-write your entire program with the notes above, and some better conventions. What you do and how you edit it is up to you.

import asyncio
import aiohttp

websites = """https://www.youtube.com
https://www.facebook.com
https://www.baidu.com
https://www.yahoo.com
https://www.amazon.com
https://www.wikipedia.org
http://www.qq.com
https://www.google.co.in
https://www.twitter.com
https://www.live.com
http://www.taobao.com
https://www.bing.com
https://www.instagram.com
http://www.weibo.com
http://www.sina.com.cn
https://www.linkedin.com
http://www.yahoo.co.jp
http://www.msn.com
http://www.uol.com.br
https://www.google.de
http://www.yandex.ru
http://www.hao123.com
https://www.google.co.uk
https://www.reddit.com
https://www.ebay.com
https://www.google.fr
https://www.t.co
http://www.tmall.com
http://www.google.com.br
https://www.360.cn
http://www.sohu.com
https://www.amazon.co.jp
http://www.pinterest.com
https://www.netflix.com
http://www.google.it
https://www.google.ru
https://www.microsoft.com
http://www.google.es
https://www.wordpress.com
http://www.gmw.cn
https://www.tumblr.com
http://www.paypal.com
http://www.blogspot.com
http://www.imgur.com
https://www.stackoverflow.com
https://www.aliexpress.com
https://www.naver.com
http://www.ok.ru
https://www.apple.com
http://www.github.com
http://www.chinadaily.com.cn
http://www.imdb.com
https://www.google.co.kr
http://www.fc2.com
http://www.jd.com
http://www.blogger.com
http://www.163.com
http://www.google.ca
https://www.whatsapp.com
https://www.amazon.in
http://www.office.com
http://www.tianya.cn
http://www.google.co.id
http://www.youku.com
https://www.example.com
http://www.craigslist.org
https://www.amazon.de
http://www.nicovideo.jp
https://www.google.pl
http://www.soso.com
http://www.bilibili.com
http://www.dropbox.com
http://www.xinhuanet.com
http://www.outbrain.com
http://www.pixnet.net
http://www.alibaba.com
http://www.alipay.com
http://www.chrome.com
http://www.booking.com
http://www.googleusercontent.com
http://www.google.com.au
http://www.popads.net
http://www.cntv.cn
http://www.zhihu.com
https://www.amazon.co.uk
http://www.diply.com
http://www.coccoc.com
https://www.cnn.com
http://www.bbc.co.uk
https://www.twitch.tv
https://www.wikia.com
http://www.google.co.th
http://www.go.com
https://www.google.com.ph
http://www.doubleclick.net
http://www.onet.pl
http://www.googleadservices.com
http://www.accuweather.com
http://www.googleweblight.com
http://www.answers.yahoo.com"""

async def get(url, session):
    try:
        async with session.get(url=url) as response:
            resp = await response.read()
            print("Successfully got url {} with resp of length {}.".format(url, len(resp)))
    except Exception as e:
        print("Unable to get url {} due to {}.".format(url, e.__class__))

async def main(urls):
    connector = aiohttp.TCPConnector()
    session = aiohttp.ClientSession(connector=connector)

    ret = await asyncio.gather(*[get(url, session) for url in urls])
    print("Finalized all. Return is a list of len {} outputs.".format(len(ret)))

    await session.close()

urls = websites.split("\n")
asyncio.run(main(urls))

Outputs:

Successfully got url http://www.google.com.br with resp of length 12475.
Successfully got url http://www.google.es with resp of length 12432.
Successfully got url http://www.google.it with resp of length 12450.
Successfully got url https://www.t.co with resp of length 0.
Successfully got url https://www.example.com with resp of length 1256.
Successfully got url https://www.google.fr with resp of length 12478.
Successfully got url https://www.google.de with resp of length 12463.
Successfully got url http://www.googleusercontent.com with resp of length 1561.
Successfully got url https://www.google.co.in with resp of length 11867.
Successfully got url https://www.google.co.uk with resp of length 11890.
Successfully got url https://www.google.ru with resp of length 12445.
Successfully got url https://www.bing.com with resp of length 97269.
Successfully got url https://www.facebook.com with resp of length 128029.
Successfully got url http://www.google.ca with resp of length 11803.
Successfully got url http://www.google.co.id with resp of length 12476.
Successfully got url https://www.google.co.kr with resp of length 12484.
Successfully got url https://www.instagram.com with resp of length 37967.
Successfully got url https://www.tumblr.com with resp of length 75321.
Successfully got url https://www.apple.com with resp of length 62405.
Successfully got url https://www.wikipedia.org with resp of length 76489.
Successfully got url https://www.whatsapp.com with resp of length 80930.
Successfully got url http://www.googleweblight.com with resp of length 0.
Successfully got url https://www.microsoft.com with resp of length 179346.
Successfully got url https://www.google.pl with resp of length 12447.
Successfully got url https://www.linkedin.com with resp of length 82074.
Successfully got url http://www.google.com.au with resp of length 11844.
Successfully got url http://www.googleadservices.com with resp of length 1561.
Successfully got url https://www.twitter.com with resp of length 327282.
Successfully got url http://www.163.com with resp of length 498893.
Successfully got url http://www.google.co.th with resp of length 12492.
Successfully got url https://www.stackoverflow.com with resp of length 117754.
Successfully got url http://www.accuweather.com with resp of length 268.
Successfully got url http://www.pinterest.com with resp of length 54089.
Successfully got url http://www.uol.com.br with resp of length 364068.
Successfully got url https://www.google.com.ph with resp of length 11874.
Successfully got url https://www.youtube.com with resp of length 301882.
Successfully got url https://www.wikia.com with resp of length 285727.
Successfully got url https://www.amazon.com with resp of length 545564.
Successfully got url https://www.wordpress.com with resp of length 87837.
Successfully got url http://www.cntv.cn with resp of length 3200.
Successfully got url https://www.live.com with resp of length 36964.
Successfully got url http://www.gmw.cn with resp of length 120034.
Successfully got url http://www.chrome.com with resp of length 161590.
Successfully got url https://www.netflix.com with resp of length 495818.
Successfully got url http://www.tianya.cn with resp of length 7888.
Successfully got url http://www.imgur.com with resp of length 4209.
Successfully got url https://www.twitch.tv with resp of length 89364.
Successfully got url http://www.msn.com with resp of length 47196.
Successfully got url https://www.cnn.com with resp of length 1136910.
Successfully got url http://www.doubleclick.net with resp of length 127443.
Successfully got url https://www.naver.com with resp of length 198837.
Successfully got url https://www.yahoo.com with resp of length 536726.
Successfully got url http://www.sohu.com with resp of length 205715.
Successfully got url http://www.office.com with resp of length 90082.
Successfully got url http://www.popads.net with resp of length 14548.
Successfully got url http://www.qq.com with resp of length 235514.
Successfully got url http://www.blogspot.com with resp of length 94478.
Successfully got url https://www.amazon.in with resp of length 449774.
Successfully got url http://www.imdb.com with resp of length 347893.
Successfully got url http://www.alibaba.com with resp of length 153300.
Successfully got url https://www.baidu.com with resp of length 158941.
Successfully got url https://www.amazon.co.jp with resp of length 435298.
Successfully got url https://www.aliexpress.com with resp of length 60278.
Successfully got url http://www.xinhuanet.com with resp of length 176985.
Successfully got url http://www.blogger.com with resp of length 94478.
Successfully got url https://www.amazon.co.uk with resp of length 672572.
Successfully got url http://www.paypal.com with resp of length 44020.
Successfully got url http://www.github.com with resp of length 133317.
Successfully got url http://www.dropbox.com with resp of length 271286.
Successfully got url https://www.amazon.de with resp of length 438965.
Successfully got url http://www.soso.com with resp of length 5816.
Successfully got url https://www.ebay.com with resp of length 301959.
Successfully got url http://www.answers.yahoo.com with resp of length 96590.
Successfully got url http://www.fc2.com with resp of length 34544.
Successfully got url https://www.reddit.com with resp of length 656718.
Successfully got url http://www.go.com with resp of length 733683.
Successfully got url http://www.chinadaily.com.cn with resp of length 102734.
Successfully got url http://www.craigslist.org with resp of length 59273.
Successfully got url http://www.bilibili.com with resp of length 95028.
Successfully got url http://www.zhihu.com with resp of length 45853.
Successfully got url http://www.yandex.ru with resp of length 114932.
Successfully got url https://www.360.cn with resp of length 74085.
Successfully got url http://www.tmall.com with resp of length 227590.
Successfully got url http://www.bbc.co.uk with resp of length 326671.
Successfully got url http://www.jd.com with resp of length 18105.
Successfully got url http://www.outbrain.com with resp of length 48191.
Successfully got url http://www.pixnet.net with resp of length 6295.
Successfully got url http://www.diply.com with resp of length 762463.
Successfully got url http://www.booking.com with resp of length 445064.
Successfully got url http://www.nicovideo.jp with resp of length 106691.
Successfully got url http://www.onet.pl with resp of length 778449.
Successfully got url http://www.yahoo.co.jp with resp of length 18107.
Successfully got url http://www.hao123.com with resp of length 304041.
Successfully got url http://www.alipay.com with resp of length 21561.
Successfully got url http://www.ok.ru with resp of length 138096.
Successfully got url http://www.coccoc.com with resp of length 46725.
Successfully got url http://www.taobao.com with resp of length 393906.
Successfully got url http://www.sina.com.cn with resp of length 546781.
Successfully got url http://www.weibo.com with resp of length 96263.
Successfully got url http://www.youku.com with resp of length 582773.
Finalized all. Return is a list of len 100 outputs.

In only a few seconds.

like image 124
Felipe Avatar answered Mar 09 '23 06:03

Felipe