I have some trouble trying to split large files (say, around 10GB). The basic idea is simply read the lines, and group every, say 40000 lines into one file. But there are two ways of "reading" files. 1) The first one is to read the WHOLE file at once, and make it into a LIST. But this will require loading the WHOLE file into memory, which is painful for the too large file. (I think I asked such questions before) In python, approaches to read WHOLE file at once I've tried include: <pre class="prettyprint"><code>input1=f.readlines() input1 = commands.getoutput('zcat ' + file).splitlines(True) input1 = subprocess.Popen(["cat",file], stdout=subprocess.PIPE,bufsize=1) </code></pre> Well, then I can just easily group 40000 lines into one file by: <code>list[40000,80000] or list[80000,120000]</code> Or the advantage of using list is that we can easily point to specific lines. 2)The second way is to read line by line; process the line when reading it. Those read lines won't be saved in memory. Examples include: <pre class="prettyprint"><code>f=gzip.open(file) for line in f: blablabla... </code></pre> or <pre class="prettyprint"><code>for line in fileinput.FileInput(fileName): </code></pre> I'm sure for gzip.open, this f is NOT a list, but a file object. And seems we can only process line by line; then how can I execute this "split" job? How can I point to specific lines of the file object? Thanks

The best solution I have found is using the library filesplit (https://pypi.org/project/filesplit/). You only need to specify the input file, the output folder and the desired size in bytes for output files. Finally, the library will do all the work for you. <pre class="prettyprint"><code>from fsplit.filesplit import Filesplit fs = Filesplit() def split_cb(f, s): print("file: {0}, size: {1}".format(f, s)) fs.split(file="/path/to/source/file", split_size=900000, output_dir="/pathto/output/dir", callback=split_cb) </code></pre>

For a 10GB file, the second approach is clearly the way to go. Here is an outline of what you need to do: <ol> <li>Open the input file.</li> <li>Open the first output file.</li> <li>Read one line from the input file and write it to the output file.</li> <li>Maintain a count of how many lines you've written to the current output file; as soon as it reaches 40000, close the output file, and open the next one.</li> <li>Repeat steps 3-4 until you've reached the end of the input file.</li> <li>Close both files.</li> </ol>

Split large files using python

Tags:

python

split

I have some trouble trying to split large files (say, around 10GB). The basic idea is simply read the lines, and group every, say 40000 lines into one file. But there are two ways of "reading" files.

1) The first one is to read the WHOLE file at once, and make it into a LIST. But this will require loading the WHOLE file into memory, which is painful for the too large file. (I think I asked such questions before) In python, approaches to read WHOLE file at once I've tried include:

Click to copy

input1=f.readlines()

input1 = commands.getoutput('zcat ' + file).splitlines(True)

input1 = subprocess.Popen(["cat",file],
                              stdout=subprocess.PIPE,bufsize=1)

Well, then I can just easily group 40000 lines into one file by: list[40000,80000] or list[80000,120000] Or the advantage of using list is that we can easily point to specific lines.

2)The second way is to read line by line; process the line when reading it. Those read lines won't be saved in memory. Examples include:

Click to copy

f=gzip.open(file)
for line in f: blablabla...

Click to copy

for line in fileinput.FileInput(fileName):

I'm sure for gzip.open, this f is NOT a list, but a file object. And seems we can only process line by line; then how can I execute this "split" job? How can I point to specific lines of the file object?

Thanks

821

asked Nov 11 '11 15:11

LookIntoEast

5 Answers

The best solution I have found is using the library filesplit (https://pypi.org/project/filesplit/).
You only need to specify the input file, the output folder and the desired size in bytes for output files. Finally, the library will do all the work for you.

Click to copy

from fsplit.filesplit import Filesplit
fs = Filesplit()
def split_cb(f, s):
    print("file: {0}, size: {1}".format(f, s))

fs.split(file="/path/to/source/file", split_size=900000, output_dir="/pathto/output/dir", callback=split_cb)

115

answered Nov 10 '22 04:11

rafaoc

Click to copy

NUM_OF_LINES=40000
filename = 'myinput.txt'
with open(filename) as fin:
    fout = open("output0.txt","wb")
    for i,line in enumerate(fin):
      fout.write(line)
      if (i+1)%NUM_OF_LINES == 0:
        fout.close()
        fout = open("output%d.txt"%(i/NUM_OF_LINES+1),"wb")

    fout.close()

answered Nov 10 '22 03:11

yurib

If there's nothing special about having a specific number of file lines in each file, the readlines() function also accepts a size 'hint' parameter that behaves like this:

If given an optional parameter sizehint, it reads that many bytes from the file and enough more to complete a line, and returns the lines from that. This is often used to allow efficient reading of a large file by lines, but without having to load the entire file in memory. Only complete lines will be returned.

...so you could write that code something like this:

Click to copy

# assume that an average line is about 80 chars long, and that we want about 
# 40K in each file.

SIZE_HINT = 80 * 40000

fileNumber = 0
with open("inputFile.txt", "rt") as f:
   while True:
      buf = f.readlines(SIZE_HINT)
      if not buf:
         # we've read the entire file in, so we're done.
         break
      outFile = open("outFile%d.txt" % fileNumber, "wt")
      outFile.write(buf)
      outFile.close()
      fileNumber += 1

answered Nov 10 '22 05:11

bgporter

For a 10GB file, the second approach is clearly the way to go. Here is an outline of what you need to do:

Open the input file.
Open the first output file.
Read one line from the input file and write it to the output file.
Maintain a count of how many lines you've written to the current output file; as soon as it reaches 40000, close the output file, and open the next one.
Repeat steps 3-4 until you've reached the end of the input file.
Close both files.

answered Nov 10 '22 05:11

NPE

Click to copy

chunk_size = 40000
fout = None
for (i, line) in enumerate(fileinput.FileInput(filename)):
    if i % chunk_size == 0:
        if fout: fout.close()
        fout = open('output%d.txt' % (i/chunk_size), 'w')
    fout.write(line)
fout.close()

answered Nov 10 '22 03:11

Jason Sundram

Related questions
                            
                                Can't find '_sqlite3' module when import it using python which installed by pyenv
                            
                                serializer call is showing an TypeError: Object of type 'ListSerializer' is not JSON serializable?
                            
                                Get item from bs4.element.Tag
                            
                                pandas dataframe index: to_list() vs tolist()
                            
                                Install Oracle Instant client into Docker container for Python cx_Oracle
                            
                                Keras: UnboundLocalError: local variable 'logs' referenced before assignment
                            
                                Django 3.2 exception: django.core.exceptions.ImproperlyConfigured
                            
                                Embedded Web Server in Python? [closed]
                            
                                Getting the lesser n elements of a list in Python
                            
                                Convert little endian string to integer
                            
                                Python logger dynamic filename
                            
                                How do I get the index of the largest list inside a list of lists using Python?
                            
                                how to convert a datetime string back to datetime object?
                            
                                How to install pymssql on windows with python 2.7?
                            
                                Counting lines, words, and characters within a text file using Python
                            
                                wxPython import error
                            
                                How does the Python range function have a default parameter before the actual one?
                            
                                How to convert escaped characters?
                            
                                I want to read in a file from the command line in python
                            
                                Is Pro Django book still relevant?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Split large files using python

Tags:

python

split

LookIntoEast

People also ask

5 Answers

rafaoc

yurib

bgporter

NPE

Jason Sundram

Recent Activity

Donate For Us