Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Filter files in a very large folder

Tags:

python

file-io

I have a folder with 100k text files. I want to put files with over 20 lines in another folder. How do I do this in python? I used os.listdir, but of course, there isn't enough memory for even loading the filenames into memory. Is there a way to get maybe 100 filenames at a time?

Here's my code:

import os
import shutil

dir = '/somedir/'

def file_len(fname):
    f = open(fname,'r')
    for i, l in enumerate(f):
        pass
    f.close()
    return i + 1

filenames = os.listdir(dir+'labels/')

i = 0
for filename in filenames:
    flen = file_len(dir+'labels/'+filename)
    print flen
    if flen > 15:
        i = i+1
        shutil.copyfile(dir+'originals/'+filename[:-5], dir+'filteredOrigs/'+filename[:-5])
print i

And Output:

Traceback (most recent call last):
  File "filterimage.py", line 13, in <module>
    filenames = os.listdir(dir+'labels/')
OSError: [Errno 12] Cannot allocate memory: '/somedir/'

Here's the modified script:

import os
import shutil
import glob

topdir = '/somedir'

def filelen(fname, many):
    f = open(fname,'r')
    for i, l in enumerate(f):
        if i > many:
            f.close()
            return True
    f.close()
    return False

path = os.path.join(topdir, 'labels', '*')
i=0
for filename in glob.iglob(path):
    print filename
    if filelen(filename,5):
        i += 1
print i

it works on a folder with fewer files, but with the larger folder, all it prints is "0"... Works on linux server, prints 0 on mac... oh well...

like image 813
extraeee Avatar asked Feb 01 '10 14:02

extraeee


2 Answers

you might try using glob.iglob that returns an iterator:

topdir = os.path.join('/somedir', 'labels', '*')
for filename in glob.iglob(topdir):
     if filelen(filename) > 15:
          #do stuff

Also, please don't use dir for a variable name: you're shadowing the built-in.

Another major improvement that you can introduce is to your filelen function. If you replace it with the following, you'll save a lot of time. Trust me, what you have now is the slowest alternative:

def many_line(fname, many=15):
    for i, line in enumerate(open(fname)):
        if i > many:
            return True
    return False
like image 158
SilentGhost Avatar answered Sep 25 '22 04:09

SilentGhost


A couple thoughts. First, you might use the glob module to get smaller groups of files. Second, sorting by line count is going to be very time consuming, as you have to open every file and count lines. If you can partition by byte count, you can avoid opening the files by using the stat module. If it's crucial that the split happens at 20 lines, you can at least cut out large swaths of files by figuring out a minimum number of characters that a 20 line file of your type will have, and not opening any file smaller than that.

like image 30
jcdyer Avatar answered Sep 25 '22 04:09

jcdyer