Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fast way of merging huge files(>=7 GB) into one

Tags:

python

io

I have three huge files, with just 2 columns, and I need both. I want to merge them into one file which I can then write to a SQLite database.

I used Python and got the job done, but it took >30 minutes and also hung my system for 10 of those. I was wondering if there is a faster way by using awk or any other unix-tool. A faster way within Python would be great too. Code written below:

'''We have tweets of three months in 3 different files.
Combine them to a single file '''
import sys, os
data1 = open(sys.argv[1], 'r')
data2 = open(sys.argv[2], 'r')
data3 = open(sys.argv[3], 'r')
data4 = open(sys.argv[4], 'w')
for line in data1:
    data4.write(line)
data1.close()
for line in data2:
    data4.write(line)
data2.close()
for line in data3:
    data4.write(line)
data3.close()
data4.close()
like image 647
crazyaboutliv Avatar asked Jan 09 '12 13:01

crazyaboutliv


1 Answers

The standard Unix way to merge files is cat. It may not be much faster but it will be faster.

cat file1 file2 file3 > bigfile

Rather than make a temporary file, you may be able to cat directly to sqlite

cat file1 file2 file3 | sqlite database

In python, you will probably get better performance if you copy the file in blocks rather than lines. Use file.read(65536) to read 64k of data at a time, rather than iterating through the files with for

like image 65
rjmunro Avatar answered Sep 17 '22 12:09

rjmunro