Note: Future readers be aware, this question was old, formatted and programmed in a rush. The answer given may be useful, but the question and code probably not.
Hello everyone,
I'm having trouble understanding asyncio and aiohttp and making both work together. Because I don't understand what I'm doing I've run into a problem that I have no idea how to solve.
I'm using Windows 10 64 bits.
The following code returns a list of pages that do not contain "html" in the Content-Type header. It's implemented using asyncio.
import asyncio
import aiohttp
MAXitems = 30
async def getHeaders(url, session, sema):
async with session:
async with sema:
try:
async with session.head(url) as response:
try:
if "html" in response.headers["Content-Type"]:
return url, True
else:
return url, False
except:
return url, False
except:
return url, False
def check_urls_without_html(list_of_urls):
headers_without_html = set()
while(len(list_of_urls) != 0):
blockurls = []
print(len(list_of_urls))
items = 0
for num in range(0, len(list_of_urls)):
if num < MAXitems:
blockurls.append(list_of_urls[num - items])
list_of_urls.remove(list_of_urls[num - items])
items += 1
loop = asyncio.get_event_loop()
semaphoreHeaders = asyncio.Semaphore(50)
session = aiohttp.ClientSession()
data = loop.run_until_complete(asyncio.gather(*(getHeaders(url, session, semaphoreHeaders) for url in blockurls)))
for header in data:
if not header[1]:
headers_without_html.add(header)
return headers_without_html
list_of_urls= ['http://www.google.com', 'http://www.reddit.com']
headers_without_html = check_urls_without_html(list_of_urls)
for header in headers_without_html:
print(header[0])
When I run it with too many URLs (ie 2000) sometimes it returns an error like like this one:
data = loop.run_until_complete(asyncio.gather(*(getHeaders(url, session, semaphoreHeaders) for url in blockurls)))
File "USER\AppData\Local\Programs\Python\Python36-32\lib\asyncio\base_events.py", line 454, in run_until_complete
self.run_forever()
File "USER\AppData\Local\Programs\Python\Python36-32\lib\asyncio\base_events.py", line 421, in run_forever
self._run_once()
File "USER\AppData\Local\Programs\Python\Python36-32\lib\asyncio\base_events.py", line 1390, in _run_once
event_list = self._selector.select(timeout)
File "USER\AppData\Local\Programs\Python\Python36-32\lib\selectors.py", line 323, in select
r, w, _ = self._select(self._readers, self._writers, [], timeout)
File "USER\AppData\Local\Programs\Python\Python36-32\lib\selectors.py", line 314, in _select
r, w, x = select.select(r, w, w, timeout)
ValueError: too many file descriptors in select()
I've read that problem arises from a Windows' restriction. I've also read there is not much that can be done about it, other than trying to use less file descriptors.
I've seen people push thousands of requests with asyncio and aiohttp but even with my chuncking I can't push 30-50 without getting this error.
Is there something fundamentally wrong with my code or is it an inherent problem with Windows? Can it be fixed? Can one increase the limit on the maximum number of allowed file descriptors in select?
By default Windows can use only 64 sockets in asyncio loop. This is a limitation of underlying select() API call.
To increase the limit please use ProactorEventLoop
, you can use the code below. See the full docs here here.
if sys.platform == 'win32':
loop = asyncio.ProactorEventLoop()
asyncio.set_event_loop(loop)
Another solution is to limit the overall concurrency using a sempahore, see the answer provided here. For example, when doing 2000 API calls you might want not want too many parallel open requests (they might timeout / more difficult to see the individual calling times). This will give you
await gather_with_concurrency(100, *my_coroutines)
I'm having the same problem. Not 100% sure that this is guaranteed to work, but try replacing this:
session = aiohttp.ClientSession()
with this:
connector = aiohttp.TCPConnector(limit=60)
session = aiohttp.ClientSession(connector=connector)
By default limit
is set to 100 (docs), meaning that the client can have 100 simultaneous connections open at a time. As Andrew mentioned, Windows can only have 64 sockets open at a time, so we provide a number lower than 64 instead.
#Add to call area
loop = asyncio.ProactorEventLoop()
asyncio.set_event_loop(loop)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With