Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Split large text file(around 50GB) into multiple files

I would like to split a large text file around size of 50GB into multiple files. Data in the files are like this-[x= any integer between 0-9]

xxx.xxx.xxx.xxx
xxx.xxx.xxx.xxx
xxx.xxx.xxx.xxx
xxx.xxx.xxx.xxx
...............
...............

There might be few billions of lines in the file and i would like write for example 30/40 millions per file. I guess the steps would be-

  • I've to open the file
  • then using readline() have to read the file line by line and write at the same time to a new file
  • and as soon as it hits the maximum number of lines it will create another file and starts writing again.

I'm wondering, how to put all these steps together in a memory efficient and faster way. I've seen some examples in stack but none of them totally helping what i exactly need. I would really appreciate if anyone could help me out.

like image 909
saz Avatar asked Mar 30 '14 22:03

saz


3 Answers

This working solution uses split command available in shell. Since the author has already accepted a possibility of a non-python solution, please do not downvote.

First, I created a test file with 1000M entries (15 GB) with

awk 'BEGIN{for (i = 0; i < 1000000000; i++) {print "123.123.123.123"} }' > t.txt

Then I used split:

split --lines=30000000 --numeric-suffixes --suffix-length=2 t.txt t

It took 5 min to produce a set of 34 small files with names t00-t33. 33 files are 458 MB each and the last t33 is 153 MB.

like image 100
Andrey Avatar answered Nov 14 '22 07:11

Andrey


from itertools import chain, islice

def chunks(iterable, n):
   "chunks(ABCDE,2) => AB CD E"
   iterable = iter(iterable)
   while True:
       # store one line in memory,
       # chain it to an iterator on the rest of the chunk
       yield chain([next(iterable)], islice(iterable, n-1))

l = 30*10**6
file_large = 'large_file.txt'
with open(file_large) as bigfile:
    for i, lines in enumerate(chunks(bigfile, l)):
        file_split = '{}.{}'.format(file_large, i)
        with open(file_split, 'w') as f:
            f.writelines(lines)
like image 37
log0 Avatar answered Nov 14 '22 09:11

log0


I would use the Unix utility split, if it is available to you and your only task is to split the file. Here is however a pure Python solution:

import contextlib

file_large = 'large_file.txt'
l = 30*10**6  # lines per split file
with contextlib.ExitStack() as stack:
    fd_in = stack.enter_context(open(file_large))
    for i, line in enumerate(fd_in):
        if not i % l:
           file_split = '{}.{}'.format(file_large, i//l)
           fd_out = stack.enter_context(open(file_split, 'w'))
        fd_out.write('{}\n'.format(line))

If all of your lines have 4 3-digit numbers on them and you have multiple cores available, then you can exploit file seek and run multiple processes.

like image 3
tommy.carstensen Avatar answered Nov 14 '22 08:11

tommy.carstensen