Why is there no execution time difference between multithreading and singlethreading

Tags:

I am trying to learn python language and it's concept. I wrote some code to play with multithreading. But i notice that there is no execution time difference between multi and single threading.

The machine which is to run script has 4 core/thread.

 def get_tokens(file_name,map):
    print(file_name)
    counter = 0
    with open(file_name,'r',encoding='utf-8-sig') as f:
        for line in f:
            item = json.loads(line,encoding='utf-8')
            if 'spot' in item and item['sid'] == 4663:
                counter+=1
                if counter == 500:
                    break
                tokens = nltk.word_tokenize(item['spot'],language='english')
                for token in tokens:
                    if token not in map:
                        map[token] = 1
                    else:
                        map[token] = map[token] + 1;

 start_time = time.time()
 map = dict();
 with ThreadPoolExecutor(max_workers=3) as executor:
    for file in FileProcessing.get_files_in_directory('D:\\Raw Data'):
        future = executor.submit(FileProcessing.get_tokens, file, map)

 end_time = time.time()
 print("Elapsed time was %g seconds" % (end_time - start_time))

Each file's size in the Raw Data is bigger than 25 mb. So i think must be difference between them. But there is not. Why ? Am i doing a mistake in code or multihreading concept ?

829

asked Jun 24 '15 22:06

RockOnGom

4 Answers

CPython (the standard implementation of Python) does not support multithreading on different CPUs. So you can indeed have multiple threads, but they will all run on the same CPU, and you will not have speed improvements for CPU-bound processes (you would have for I/O bound processes).

The reason for that is the infamous GIL (global interpreter lock). Python's core is not thread safe because of the way it does garbage collection, so it uses a lock, which means threads accessing python objects run one after the other.

In your particular case, you are doing some I/O and some processing. There is significant overhead when doing multiprocessing in python, which is not compensated by the gain in speed in your I/O (the time to read your files is probably small compared to the time to process them).

If you need to do real multithreading look at Cython (not to be confused with CPython) and "no_gil", or c-extensions, or the multiprocessing module.

164

answered Oct 12 '22 04:10

DevShark

Python has the Global Interpretor Lock (GIL), that prevents two threads of execution in the same python process from executing at the same time. Therefore, while python threads give you multiple control paths within a single process those multiple control paths cannot execute simultaneous on a multi-core machine. An alternative is to use the python multiprocessing framework which will actually create separate processes and then have the processes communicate via inter-process communication (IPC). You can also try using the ProcessPoolExecutor which will spawn multiple processes and therefore you won't have an issue with the GIL

answered Oct 12 '22 03:10

missimer

The GIL comments are correct, but this code is more likely IO-bound than CPU-bound.

Even if you used something like C or Java, you're still reading files over a serial interface, so unless you can't process 100-300 MB/s of JSON, you won't see a performance benefit from threading.

@DevShark did say you'd see a benefit for IO-bound processes, but it's more complicated than that. That tends to be more for concurrent network connections that are high-latency. In this case, you'd be IO-bound at the disk, not the process (you're not waiting for a remote response), so parallelism won't help.

If you're CPU-bound, have real threads, and are using a spinning disk, you still have to tune the buffer sizes carefully. The 10ms seek time can kill you, so you need to use buffered reads with buffer sizes much larger than that if you want high disk throughput. With a 100MB/s disk with 10ms seek time, I'd use 10MB buffers, but that still means you're spending 10% of your time seeking on the disk. I'd also coordinate my reads so only one reader reads at a time.

answered Oct 12 '22 03:10

David Ehrmann

The issue is can the code be improved through threading. If the code is serial and one thing happens after another in a straight line then the code will run the same no matter how many threads you have. However if the code can branch off and do action a and action b at the same time then it will. Looking closer at your code it appears that there is no branching, at least non that I am aware of.

answered Oct 12 '22 04:10

trinityalps

Related questions
                            
                                iterating over shared keys in two dictionaries
                            
                                Random non-uniform distribution with given proportion
                            
                                python: Convert bytearray to ctypes Struct
                            
                                How can openpyxl write list data in sheet?
                            
                                Pandas plot x or index_column in descending order
                            
                                What does extra parameter mean in this for-loop in Python?
                            
                                Rotate meshgrid with numpy
                            
                                Python cannot import DataFrame
                            
                                Pandas: how to plot with custom colors and symbols
                            
                                What is this piece of code doing, python
                            
                                Using py.test --cov from inside setup.py pytest.main
                            
                                "stale association proxy, parent object has gone out of scope" with Flask-SQLAlchemy
                            
                                flask blueprints list routes
                            
                                Error - "SQLite DateTime type only accepts Python " "datetime and date objects as input."
                            
                                Python: How to check if two lists are not empty
                            
                                understanding asyncio already running forever loop and pending tasks
                            
                                How to remove multiple values from an array at once
                            
                                how to create a 3D height map in python
                            
                                How to join 2 lists of dicts in python?
                            
                                Python Logging : how to add a custom field to LogRecord, and register a global callback to sets it's value

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why is there no execution time difference between multithreading and singlethreading

Tags:

python

multithreading

python-3.4