I just played around a little bit with python and threads, and realized even in a multithreaded script, DNS requests are blocking. Consider the following script:
from threading import Thread import socket
class Connection(Thread):
def __init__(self, name, url):
Thread.__init__(self)
self._url = url
self._name = name
def run(self):
print "Connecting...", self._name
try:
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.setblocking(0)
s.connect((self._url, 80))
except socket.gaierror:
pass #not interested in it
print "finished", self._name
if __name__ == '__main__':
conns = []
# all invalid addresses to see how they fail / check times
conns.append(Connection("conn1", "www.2eg11erdhrtj.com"))
conns.append(Connection("conn2", "www.e2ger2dh2rtj.com"))
conns.append(Connection("conn3", "www.eg2de3rh1rtj.com"))
conns.append(Connection("conn4", "www.ege2rh4rd1tj.com"))
conns.append(Connection("conn5", "www.ege52drhrtj1.com"))
for conn in conns:
conn.start()
I dont know exactly how long the timeout is, but when running this the following happens:
So my only guess is that this has to do with the GIL? Obviously the threads do not perform their task concurrently, only one connection is attempted at a time.
Does anyone know a way around this?
(asyncore doesnt help, and I'd prefer not to use twisted for now) Isn't it possible to get this simple little thing done with python?
Greetings, Tom
I am on MacOSX, I just let my friend run this on linux, and he actually does get the results I wished to get. His socket.connects()'s return immediately, even in a non Threaded environment. And even when he sets the sockets to blocking, and timeout to 10 seconds, all his Threads finish at the same time.
Can anyone explain this?
Python doesn't support multi-threading because Python on the Cpython interpreter does not support true multi-core execution via multithreading. However, Python does have a threading library. The GIL does not prevent threading.
Both multithreading and multiprocessing allow Python code to run concurrently. Only multiprocessing will allow your code to be truly parallel. However, if your code is IO-heavy (like HTTP requests), then multithreading will still probably speed up your code.
Python is NOT a single-threaded language. Python processes typically use a single thread because of the GIL. Despite the GIL, libraries that perform computationally heavy tasks like numpy, scipy and pytorch utilise C-based implementations under the hood, allowing the use of multiple cores.
In fact, a Python process cannot run threads in parallel but it can run them concurrently through context switching during I/O bound operations. This limitation is actually enforced by GIL. The Python Global Interpreter Lock (GIL) prevents threads within the same process to be executed at the same time.
On some systems, getaddrinfo is not thread-safe. Python believes that some such systems are FreeBSD, OpenBSD, NetBSD, OSX, and VMS. On those systems, Python maintains a lock specifically for the netdb (i.e. getaddrinfo and friends).
So if you can't switch operating systems, you'll have to use a different (thread-safe) resolver library, such as twisted's.
Send DNS requests asynchronously using Twisted Names:
import sys
from twisted.internet import reactor
from twisted.internet import defer
from twisted.names import client
from twisted.python import log
def process_names(names):
log.startLogging(sys.stderr, setStdout=False)
def print_results(results):
for name, (success, result) in zip(names, results):
if success:
print "%s -> %s" % (name, result)
else:
print >>sys.stderr, "error: %s failed. Reason: %s" % (
name, result)
d = defer.DeferredList(map(client.getHostByName, names), consumeErrors=True)
d.addCallback(print_results)
d.addErrback(defer.logError)
d.addBoth(lambda _: reactor.stop())
reactor.callWhenRunning(process_names, """
google.com
www.2eg11erdhrtj.com
www.e2ger2dh2rtj.com
www.eg2de3rh1rtj.com
www.ege2rh4rd1tj.com
www.ege52drhrtj1.com
""".split())
reactor.run()
if it's suitable you could use the multiprocessing
module to enable process-based parallelism
import multiprocessing, socket
NUM_PROCESSES = 5
def get_url(url):
try:
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.setblocking(0)
s.connect((url, 80))
except socket.gaierror:
pass #not interested in it
return 'finished ' + url
def main(url_list):
pool = multiprocessing.Pool( NUM_PROCESSES )
for output in pool.imap_unordered(get_url, url_list):
print output
if __name__=="__main__":
main("""
www.2eg11erdhrtj.com
www.e2ger2dh2rtj.com
www.eg2de3rh1rtj.com
www.ege2rh4rd1tj.com
www.ege52drhrtj1.com
""".split())
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With