Given a directory with a large number of small files (>1 mio) what's a fast way to remember which files were already processed (for a database import).
The first solution I tried was a bash script:
#find all gz files
for f in $(find $rawdatapath -name '*.gz'); do
filename=`basename $f`
#check whether the filename is already contained in the process list
onlist=`grep $filename $processed_files`
if [[ -z $onlist ]]
then
echo "processing, new: $filename"
#unzip file and import into mongodb
#write filename into processed list
echo $filename #>> $processed_files
fi
done
For a smaller sample (160k files) this ran ~8 minutes (without any processing)
Next I tried a python script:
import os
path = "/home/b2blogin/webapps/mongodb/rawdata/segment_slideproof_testing"
processed_files_file = os.path.join(path,"processed_files.txt")
processed_files = [line.strip() for line in open(processed_files_file)]
with open(processed_files_file, "a") as pff:
for root, dirs, files in os.walk(path):
for file in files:
if file.endswith(".gz"):
if file not in processed_files:
pff.write("%s\n" % file)
This runs in less than 2 mins.
Is there a significantly faster way that I'm overlooking?
Other solutions:
Just use a set:
import os
path = "/home/b2blogin/webapps/mongodb/rawdata/segment_slideproof_testing"
processed_files_file = os.path.join(path,"processed_files.txt")
processed_files = set(line.strip() for line in open(processed_files_file))
with open(processed_files_file, "a") as pff:
for root, dirs, files in os.walk(path):
for file in files:
if file.endswith(".gz"):
if file not in processed_files:
pff.write("%s\n" % file)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With