I am trying to learn python language and it's concept. I wrote some code to play with multithreading. But i notice that there is no execution time difference between multi and single threading.
The machine which is to run script has 4 core/thread.
def get_tokens(file_name,map):
print(file_name)
counter = 0
with open(file_name,'r',encoding='utf-8-sig') as f:
for line in f:
item = json.loads(line,encoding='utf-8')
if 'spot' in item and item['sid'] == 4663:
counter+=1
if counter == 500:
break
tokens = nltk.word_tokenize(item['spot'],language='english')
for token in tokens:
if token not in map:
map[token] = 1
else:
map[token] = map[token] + 1;
start_time = time.time()
map = dict();
with ThreadPoolExecutor(max_workers=3) as executor:
for file in FileProcessing.get_files_in_directory('D:\\Raw Data'):
future = executor.submit(FileProcessing.get_tokens, file, map)
end_time = time.time()
print("Elapsed time was %g seconds" % (end_time - start_time))
Each file's size in the Raw Data is bigger than 25 mb. So i think must be difference between them. But there is not. Why ? Am i doing a mistake in code or multihreading concept ?
Every thread needs some overhead and system resources, so it also slows down performance. Another problem is the so called "thread explosion" when MORE thread are created than cores are on the system. And some waiting threads for the end of other threads is the worst idea for multi threading.
In General: Multi threading may improve throughput of the application by using more CPU power. it depends on a lot of factors. If not, the performance depends on above factors and throughput will vary between single threaded application and multi-threading application.
"Single-threaded" means that we open a single connection and measure the speeds from that. "Multi-threaded" means that we're using multiple connections - usually anywhere from 3 to 8 - at the same time, and measure the total speed across them all.
Advantages of Multithreaded Processes All the threads of a process share its resources such as memory, data, files etc. A single application can have different threads within the same address space using resource sharing. It is more economical to use threads as they share the process resources.
CPython (the standard implementation of Python) does not support multithreading on different CPUs. So you can indeed have multiple threads, but they will all run on the same CPU, and you will not have speed improvements for CPU-bound processes (you would have for I/O bound processes).
The reason for that is the infamous GIL (global interpreter lock). Python's core is not thread safe because of the way it does garbage collection, so it uses a lock, which means threads accessing python objects run one after the other.
In your particular case, you are doing some I/O and some processing. There is significant overhead when doing multiprocessing in python, which is not compensated by the gain in speed in your I/O (the time to read your files is probably small compared to the time to process them).
If you need to do real multithreading look at Cython (not to be confused with CPython) and "no_gil", or c-extensions, or the multiprocessing module.
Python has the Global Interpretor Lock (GIL), that prevents two threads of execution in the same python process from executing at the same time. Therefore, while python threads give you multiple control paths within a single process those multiple control paths cannot execute simultaneous on a multi-core machine. An alternative is to use the python multiprocessing framework which will actually create separate processes and then have the processes communicate via inter-process communication (IPC). You can also try using the ProcessPoolExecutor which will spawn multiple processes and therefore you won't have an issue with the GIL
The GIL comments are correct, but this code is more likely IO-bound than CPU-bound.
Even if you used something like C or Java, you're still reading files over a serial interface, so unless you can't process 100-300 MB/s of JSON, you won't see a performance benefit from threading.
@DevShark did say you'd see a benefit for IO-bound processes, but it's more complicated than that. That tends to be more for concurrent network connections that are high-latency. In this case, you'd be IO-bound at the disk, not the process (you're not waiting for a remote response), so parallelism won't help.
If you're CPU-bound, have real threads, and are using a spinning disk, you still have to tune the buffer sizes carefully. The 10ms seek time can kill you, so you need to use buffered reads with buffer sizes much larger than that if you want high disk throughput. With a 100MB/s disk with 10ms seek time, I'd use 10MB buffers, but that still means you're spending 10% of your time seeking on the disk. I'd also coordinate my reads so only one reader reads at a time.
The issue is can the code be improved through threading. If the code is serial and one thing happens after another in a straight line then the code will run the same no matter how many threads you have. However if the code can branch off and do action a and action b at the same time then it will. Looking closer at your code it appears that there is no branching, at least non that I am aware of.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With