I have a script that uses a lot of headless Selenium automation and looped HTTP requests. It's very important that I implement a threading/worker queue for this script. I've done that.
My question is: Should I be using multi-thread or multi-process? Thread or ProcessPool? I know that:
"If your program spends more time waiting on file reads or network requests or any type of I/O task, then it is an I/O bottleneck and you should be looking at using threads to speed it up."
and...
"If your program spends more time in CPU based tasks over large datasets then it is a CPU bottleneck. In this scenario you may be better off using multiple processes in order to speed up your program. I say may as it’s possible that a single-threaded Python program may be faster for CPU bound problems, it can depend on unknown factors such as the size of the problem set and so on."
Which is the case when it comes to Selenium? Am I right to think that all CPU-bound tasks related to Selenium will be executed separately via the web driver or would my script benefit from multiple processes?
Or to be more concise: When I thread Selenium in my script, is the web driver limited to 1 CPU core, the same core the script threads are running on?
Multiprocessing is used to create a more reliable system, whereas multithreading is used to create threads that run parallel to each other. multithreading is quick to create and requires few resources, whereas multiprocessing requires a significant amount of time and specific resources to create.
By formal definition, multithreading refers to the ability of a processor to execute multiple threads concurrently, where each thread runs a process. Whereas multiprocessing refers to the ability of a system to run multiple processors concurrently, where each processor can run one or more threads.
For most problems, multithreading is probably significantly faster than using multiple processes, but as soon as you encounter hardware limitations, that answer goes out the window.
No. Multithreading can cause concurrency but they are not the same thing. Multithreading means multiple thread doing different things simultaneously so that app efficiency is increased.
Web driver is just a driver, a driver cannot drive a car without a car.
For example when you use ChromeDriver
to communicate with browser, you are launching Chrome. And ChromeDriver
itself does no calculation but Chrome does.
So to clarify, webdriver is a tool to manipulate browser but itself is not a browser.
Based on this, definitely you should choose thread pool instead of process pool as it is surely an I/O bound problem in your python script.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With