Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python: slow read & write for millions of small files

Tags:

Conclusion: It seems that HDF5 is the way to go for my purposes. Basically "HDF5 is a data model, library, and file format for storing and managing data." and is designed to handle incredible amounts of data. It has a Python module called python-tables. (The link is in the answer below)

HDF5 does the job done 1000% better in saving tons and tons of data. Reading/modifying the data from 200 million rows is a pain though, so that's the next problem to tackle.


I am building directory tree which has tons of subdirectories and files. There are about 10 million files spread around a hundred thousand directories. Each file is under 32 subdirectories.

I have a python script that builds this filesystem and reads & writes those files. The problem is that when I reach more than a million files, the read and write methods become extremely slow.

Here's the function I have that reads the contents of a file (the file contains an integer string), adds a certain number to it, then writes it back to the original file.

def addInFile(path, scoreToAdd):
    num = scoreToAdd
    try:
        shutil.copyfile(path, '/tmp/tmp.txt')
        fp = open('/tmp/tmp.txt', 'r')
        num += int(fp.readlines()[0])
        fp.close()
    except:
        pass
    fp = open('/tmp/tmp.txt', 'w')
    fp.write(str(num))
    fp.close()
    shutil.copyfile('/tmp/tmp.txt', path)
  • Relational databases seem too slow for accessing these data, so I opted for a filesystem approach.
  • I previously tried performing linux console commands for these but it was way slower.
  • I copy the file to a temporary file first then access/modify it then copy it back because i found this was faster than directly accessing the file.
  • Putting all the files into 1 directory (in reiserfs format) caused too much slowdown when accessing the files.

I think the cause of the slowdown is because there're tons of files. Performing this function 1000 times clocked at less than a second.. but now it's reaching 1 minute.

How do you suggest I fix this? Do I change my directory tree structure?

All I need is to quickly access each file in this very huge pool of files*