What is the fastest method to concatenate multiple files column wise (within Python)?
Assume that I have two files with 1,000,000,000 lines and ~200 UTF8 characters per line.
Method 1: Cheating with paste
I could concatenate the two files under a linux system by using paste
in shell and I could cheat using os.system
, i.e.:
def concat_files_cheat(file_path, file1, file2, output_path, output):
file1 = os.path.join(file_path, file1)
file2 = os.path.join(file_path, file2)
output = os.path.join(output_path, output)
if not os.path.exists(output):
os.system('paste ' + file1 + ' ' + file2 + ' > ' + output)
Method 2: Using nested context manager with zip
:
def concat_files_zip(file_path, file1, file2, output_path, output):
with open(output, 'wb') as fout:
with open(file1, 'rb') as fin1, open(file2, 'rb') as fin2:
for line1, line2 in zip(fin1, fin2):
fout.write(line1 + '\t' + line2)
Method 3: Using fileinput
Does fileinput
iterate through the files in parallel? Or will they iterate through each file sequentially on after the other?
If it is the former, I would assume it would look like this:
def concat_files_fileinput(file_path, file1, file2, output_path, output):
with fileinput.input(files=(file1, file2)) as f:
for line in f:
line1, line2 = process(line)
fout.write(line1 + '\t' + line2)
Method 4: Treat them like csv
with open(output, 'wb') as fout:
with open(file1, 'rb') as fin1, open(file2, 'rb') as fin2:
writer = csv.writer(w)
reader1, reader2 = csv.reader(fin1), csv.reader(fin2)
for line1, line2 in zip(reader1, reader2):
writer.writerow(line1 + '\t' + line2)
Given the data size, which would be the fastest?
Why would one choose one over the other? Would I lose or add information?
For each method how would I choose a different delimiter other than ,
or \t
?
Are there other ways of achieving the same concatenation column wise? Are they as fast?
From all four methods I'd take the second. But you have to take care of small details in the implementation. (with a few improvements it takes 0.002 seconds meanwhile the original implementation takes about 6 seconds; the file I was working was 1M rows; but there should not be too much difference if the file is 1K times bigger as we are not using almost memory).
Changes from the original implementation:
Example:
def concat_iter(file1, file2, output):
with open(output, 'w', 1024) as fo, \
open(file1, 'r') as f1, \
open(file2, 'r') as f2:
fo.write("".join("{}\t{}".format(l1, l2)
for l1, l2 in izip(f1.readlines(1024),
f2.readlines(1024))))
Profiler original solution.
We see that the biggest problem is in write and zip (mainly for not using iterators and having to handle/ process all file in memory).
~/personal/python-algorithms/files$ python -m cProfile sol_original.py
10000006 function calls in 5.208 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 5.208 5.208 sol_original.py:1(<module>)
1 2.422 2.422 5.208 5.208 sol_original.py:1(concat_files_zip)
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
**9999999 1.713 0.000 1.713 0.000 {method 'write' of 'file' objects}**
3 0.000 0.000 0.000 0.000 {open}
1 1.072 1.072 1.072 1.072 {zip}
Profiler:
~/personal/python-algorithms/files$ python -m cProfile sol1.py
3731 function calls in 0.002 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 0.002 0.002 sol1.py:1(<module>)
1 0.000 0.000 0.002 0.002 sol1.py:3(concat_iter6)
1861 0.001 0.000 0.001 0.000 sol1.py:5(<genexpr>)
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
1860 0.001 0.000 0.001 0.000 {method 'format' of 'str' objects}
1 0.000 0.000 0.002 0.002 {method 'join' of 'str' objects}
2 0.000 0.000 0.000 0.000 {method 'readlines' of 'file' objects}
**1 0.000 0.000 0.000 0.000 {method 'write' of 'file' objects}**
3 0.000 0.000 0.000 0.000 {open}
And in python 3 is even faster, because iterators are built-in and we dont need to import any library.
~/personal/python-algorithms/files$ python3.5 -m cProfile sol2.py
843 function calls (842 primitive calls) in 0.001 seconds
[...]
And also it's very nice to see memory consumption and File System accesses that confirms what we have said before:
$ /usr/bin/time -v python sol1.py
Command being timed: "python sol1.py"
User time (seconds): 0.01
[...]
Maximum resident set size (kbytes): 7120
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 914
[...]
File system outputs: 40
Socket messages sent: 0
Socket messages received: 0
$ /usr/bin/time -v python sol_original.py
Command being timed: "python sol_original.py"
User time (seconds): 5.64
[...]
Maximum resident set size (kbytes): 1752852
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 427697
[...]
File system inputs: 0
File system outputs: 327696
You can replace the for
loop with writelines
by passing a genexp to it and replace zip
with izip
from itertools
in method 2. This may come close to paste
or surpass it.
with open(file1, 'rb') as fin1, open(file2, 'rb') as fin2, open(output, 'wb') as fout:
fout.writelines(b"{}\t{}".format(*line) for line in izip(fin1, fin2))
If you don't want to embed \t
in the format string, you can use repeat
from itertools
;
fout.writelines(b"{}{}{}".format(*line) for line in izip(fin1, repeat(b'\t'), fin2))
If the files are of same length, you can do away with izip
.
with open(file1, 'rb') as fin1, open(file2, 'rb') as fin2, open(output, 'wb') as fout:
fout.writelines(b"{}\t{}".format(line, next(fin2)) for line in fin1)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With