I am currently pulling .txt files from the path list of FileNameList, which is working. But my main problem is, it is too slow when the files is too many.
I am using this code to print list of txt files,
import os
import sys
#FileNameList is my set of files from my path
for filefolder in FileNameList:
for file in os.listdir(filefolder):
if "txt" in file:
filename = filefolder + "\\" + file
print filename
Any help or suggestion to have thread/multiprocess and make it fast reading will accept. Thanks in advance.
Threads are faster to start than processes and also faster in task-switching. All Threads share a process memory pool that is very beneficial. Takes lesser time to create a new thread in the existing process than a new process.
Both multithreading and multiprocessing allow Python code to run concurrently. Only multiprocessing will allow your code to be truly parallel. However, if your code is IO-heavy (like HTTP requests), then multithreading will still probably speed up your code.
Steps used to open multiple files together in Python: Both the files are opened with an open() method using different names for each. The contents of the files can be accessed using the readline() method.
Multiple threads can also read data from the same FITS file simultaneously, as long as the file was opened independently by each thread. This relies on the operating system to correctly deal with reading the same file by multiple processes.
So you mean there is no way to speed this up?, because my scenario is to read bunch of files then read each lines of it and store it to the database
The first rule of optimization is to ask yourself if you should bother. If your program is run only once or a couple of times optimizing it is a waste of time.
The second rule is that before you do anything else, measure where the problem lies;
Write a simple program that sequentially reads files, splits them into lines and stuffs those in a database. Run that program under a profiler to see where the program is spending most of its time.
Only then do you know which part of the program needs speeding up.
Here are some pointers nevertheless.
mmap
.multiprocessing.Pool
to spread out the reading of multiple files over different cores. But then the data from those files will end up in different processes and would have to be sent back to the parent process using IPC. This has significant overhead for large amounts of data.Multithreading or multiprocessing is not going to speed this up; your bottleneck is the storage device.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With