Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to process only new (unprocessed) files in linux

Tags:

python

linux

bash

Given a directory with a large number of small files (>1 mio) what's a fast way to remember which files were already processed (for a database import).

The first solution I tried was a bash script:

#find all gz files
for f in $(find $rawdatapath -name '*.gz'); do
    filename=`basename $f`

    #check whether the filename is already contained in the process list
    onlist=`grep $filename $processed_files`
    if [[ -z $onlist ]]
        then
            echo "processing, new: $filename"
            #unzip file and import into mongodb

            #write filename into processed list
            echo $filename #>> $processed_files
    fi
done

For a smaller sample (160k files) this ran ~8 minutes (without any processing)

Next I tried a python script:

import os

path = "/home/b2blogin/webapps/mongodb/rawdata/segment_slideproof_testing"
processed_files_file = os.path.join(path,"processed_files.txt")
processed_files = [line.strip() for line in open(processed_files_file)]

with open(processed_files_file, "a") as pff:
  for root, dirs, files in os.walk(path):
      for file in files:
          if file.endswith(".gz"):
              if file not in processed_files:
                  pff.write("%s\n" % file)

This runs in less than 2 mins.

Is there a significantly faster way that I'm overlooking?

Other solutions:

  • Moving processed files to a different locations is not convenient since I use s3sync to download new files
  • since the files have a timestamp as part of their name I might consider to rely on processing them in order and only compare the name to a "last processed" date
  • alternatively I could keep track of the last time a processing ran, and only process files that have been modified since.
like image 852
Cilvic Avatar asked May 12 '14 19:05

Cilvic


1 Answers

Just use a set:

import os

path = "/home/b2blogin/webapps/mongodb/rawdata/segment_slideproof_testing"
processed_files_file = os.path.join(path,"processed_files.txt")
processed_files = set(line.strip() for line in open(processed_files_file))

with open(processed_files_file, "a") as pff:
    for root, dirs, files in os.walk(path):
        for file in files:
            if file.endswith(".gz"):
                if file not in processed_files:
                    pff.write("%s\n" % file)
like image 93
Daniel Avatar answered Sep 28 '22 12:09

Daniel