Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reading multiple file using thread/multiprocess

I am currently pulling .txt files from the path list of FileNameList, which is working. But my main problem is, it is too slow when the files is too many.

I am using this code to print list of txt files,

import os
import sys

#FileNameList is my set of files from my path
for filefolder in FileNameList: 
  for file in os.listdir(filefolder): 
    if "txt" in file:
        filename = filefolder + "\\" + file     
        print filename

Any help or suggestion to have thread/multiprocess and make it fast reading will accept. Thanks in advance.

like image 527
Syntax Rommel Avatar asked Aug 11 '15 06:08

Syntax Rommel


People also ask

Which is faster multiprocess or multithreaded?

Threads are faster to start than processes and also faster in task-switching. All Threads share a process memory pool that is very beneficial. Takes lesser time to create a new thread in the existing process than a new process.

Can I use multithreading and multiprocessing together?

Both multithreading and multiprocessing allow Python code to run concurrently. Only multiprocessing will allow your code to be truly parallel. However, if your code is IO-heavy (like HTTP requests), then multithreading will still probably speed up your code.

How do I read multiple files at once in Python?

Steps used to open multiple files together in Python: Both the files are opened with an open() method using different names for each. The contents of the files can be accessed using the readline() method.

Can multiple threads read the same file Python?

Multiple threads can also read data from the same FITS file simultaneously, as long as the file was opened independently by each thread. This relies on the operating system to correctly deal with reading the same file by multiple processes.


2 Answers

So you mean there is no way to speed this up?, because my scenario is to read bunch of files then read each lines of it and store it to the database

The first rule of optimization is to ask yourself if you should bother. If your program is run only once or a couple of times optimizing it is a waste of time.

The second rule is that before you do anything else, measure where the problem lies;

Write a simple program that sequentially reads files, splits them into lines and stuffs those in a database. Run that program under a profiler to see where the program is spending most of its time.

Only then do you know which part of the program needs speeding up.


Here are some pointers nevertheless.

  • Speading up the reading of files can be done using mmap.
  • You could use multiprocessing.Pool to spread out the reading of multiple files over different cores. But then the data from those files will end up in different processes and would have to be sent back to the parent process using IPC. This has significant overhead for large amounts of data.
  • In the CPython implementation of Python, only one thread at a time can be executing Python bytecode. While the actual reading from files isn't inhibited by that, processing the results is. So it is questionable if threads would offer improvement.
  • Stuffing the lines into a database will probably always be a major bottleneck, because that is where everything comes together. How much of a problem this is depends on the database. Is it in-memory or on disk, does it allow multiple programs to update it simultaneously, et cetera.
like image 174
Roland Smith Avatar answered Oct 03 '22 20:10

Roland Smith


Multithreading or multiprocessing is not going to speed this up; your bottleneck is the storage device.

like image 42
Cyphase Avatar answered Oct 03 '22 20:10

Cyphase