Splitting large text file into smaller text files by line numbers using Python

Tags:

I have a text file say really_big_file.txt that contains:

line 1
line 2
line 3
line 4
...
line 99999
line 100000

I would like to write a Python script that divides really_big_file.txt into smaller files with 300 lines each. For example, small_file_300.txt to have lines 1-300, small_file_600 to have lines 301-600, and so on until there are enough small files made to contain all the lines from the big file.

I would appreciate any suggestions on the easiest way to accomplish this using Python

597

asked Apr 29 '13 23:04

walterfaye

6 Answers

lines_per_file = 300
smallfile = None
with open('really_big_file.txt') as bigfile:
    for lineno, line in enumerate(bigfile):
        if lineno % lines_per_file == 0:
            if smallfile:
                smallfile.close()
            small_filename = 'small_file_{}.txt'.format(lineno + lines_per_file)
            smallfile = open(small_filename, "w")
        smallfile.write(line)
    if smallfile:
        smallfile.close()

answered Oct 12 '22 22:10

Matt Anderson

Using itertools grouper recipe:

from itertools import zip_longest

def grouper(n, iterable, fillvalue=None):
    "Collect data into fixed-length chunks or blocks"
    # grouper(3, 'ABCDEFG', 'x') --> ABC DEF Gxx
    args = [iter(iterable)] * n
    return zip_longest(fillvalue=fillvalue, *args)

n = 300

with open('really_big_file.txt') as f:
    for i, g in enumerate(grouper(n, f, fillvalue=''), 1):
        with open('small_file_{0}'.format(i * n), 'w') as fout:
            fout.writelines(g)

The advantage of this method as opposed to storing each line in a list, is that it works with iterables, line by line, so it doesn't have to store each small_file into memory at once.

Note that the last file in this case will be small_file_100200 but will only go until line 100000. This happens because fillvalue='', meaning I write out nothing to the file when I don't have any more lines left to write because a group size doesn't divide equally. You can fix this by writing to a temp file and then renaming it after instead of naming it first like I have. Here's how that can be done.

import os, tempfile

with open('really_big_file.txt') as f:
    for i, g in enumerate(grouper(n, f, fillvalue=None)):
        with tempfile.NamedTemporaryFile('w', delete=False) as fout:
            for j, line in enumerate(g, 1): # count number of lines in group
                if line is None:
                    j -= 1 # don't count this line
                    break
                fout.write(line)
        os.rename(fout.name, 'small_file_{0}.txt'.format(i * n + j))

This time the fillvalue=None and I go through each line checking for None, when it occurs, I know the process has finished so I subtract 1 from j to not count the filler and then write the file.

answered Oct 12 '22 22:10

jamylak

I do this a more understandable way and using less short cuts in order to give you a further understanding of how and why this works. Previous answers work, but if you are not familiar with certain built-in-functions, you will not understand what the function is doing.

Because you posted no code I decided to do it this way since you could be unfamiliar with things other than basic python syntax given that the way you phrased the question made it seem as though you did not try nor had any clue as how to approach the question

Here are the steps to do this in basic python:

First you should read your file into a list for safekeeping:

my_file = 'really_big_file.txt'
hold_lines = []
with open(my_file,'r') as text_file:
    for row in text_file:
        hold_lines.append(row)

Second, you need to set up a way of creating the new files by name! I would suggest a loop along with a couple counters:

outer_count = 1
line_count = 0
sorting = True
while sorting:
    count = 0
    increment = (outer_count-1) * 300
    left = len(hold_lines) - increment
    file_name = "small_file_" + str(outer_count * 300) + ".txt"

Third, inside that loop you need some nested loops that will save the correct rows into an array:

hold_new_lines = []
    if left < 300:
        while count < left:
            hold_new_lines.append(hold_lines[line_count])
            count += 1
            line_count += 1
        sorting = False
    else:
        while count < 300:
            hold_new_lines.append(hold_lines[line_count])
            count += 1
            line_count += 1

Last thing, again in your first loop you need to write the new file and add your last counter increment so your loop will go through again and write a new file

outer_count += 1
with open(file_name,'w') as next_file:
    for row in hold_new_lines:
        next_file.write(row)

note: if the number of lines is not divisible by 300, the last file will have a name that does not correspond to the last file line.

It is important to understand why these loops work. You have it set so that on the next loop, the name of the file that you write changes because you have the name dependent on a changing variable. This is a very useful scripting tool for file accessing, opening, writing, organizing etc.

In case you could not follow what was in what loop, here is the entirety of the function:

my_file = 'really_big_file.txt'
sorting = True
hold_lines = []
with open(my_file,'r') as text_file:
    for row in text_file:
        hold_lines.append(row)
outer_count = 1
line_count = 0
while sorting:
    count = 0
    increment = (outer_count-1) * 300
    left = len(hold_lines) - increment
    file_name = "small_file_" + str(outer_count * 300) + ".txt"
    hold_new_lines = []
    if left < 300:
        while count < left:
            hold_new_lines.append(hold_lines[line_count])
            count += 1
            line_count += 1
        sorting = False
    else:
        while count < 300:
            hold_new_lines.append(hold_lines[line_count])
            count += 1
            line_count += 1
    outer_count += 1
    with open(file_name,'w') as next_file:
        for row in hold_new_lines:
            next_file.write(row)

answered Oct 12 '22 20:10

Ryan Saxe

lines_per_file = 300  # Lines on each small file
lines = []  # Stores lines not yet written on a small file
lines_counter = 0  # Same as len(lines)
created_files = 0  # Counting how many small files have been created

with open('really_big_file.txt') as big_file:
    for line in big_file:  # Go throught the whole big file
        lines.append(line)
        lines_counter += 1
        if lines_counter == lines_per_file:
            idx = lines_per_file * (created_files + 1)
            with open('small_file_%s.txt' % idx, 'w') as small_file:
                # Write all lines on small file
                small_file.write('\n'.join(stored_lines))
            lines = []  # Reset variables
            lines_counter = 0
            created_files += 1  # One more small file has been created
    # After for-loop has finished
    if lines_counter:  # There are still some lines not written on a file?
        idx = lines_per_file * (created_files + 1)
        with open('small_file_%s.txt' % idx, 'w') as small_file:
            # Write them on a last small file
            small_file.write('n'.join(stored_lines))
        created_files += 1

print '%s small files (with %s lines each) were created.' % (created_files,
                                                             lines_per_file)

answered Oct 12 '22 20:10

juliomalegria

import csv
import os
import re

MAX_CHUNKS = 300


def writeRow(idr, row):
    with open("file_%d.csv" % idr, 'ab') as file:
        writer = csv.writer(file, delimiter=',', quotechar='\"', quoting=csv.QUOTE_ALL)
        writer.writerow(row)

def cleanup():
    for f in os.listdir("."):
        if re.search("file_.*", f):
            os.remove(os.path.join(".", f))

def main():
    cleanup()
    with open("large_file.csv", 'rb') as results:
        r = csv.reader(results, delimiter=',', quotechar='\"')
        idr = 1
        for i, x in enumerate(r):
            temp = i + 1
            if not (temp % (MAX_CHUNKS + 1)):
                idr += 1
            writeRow(idr, x)

if __name__ == "__main__": main()

answered Oct 12 '22 21:10

Varun

with open('/really_big_file.txt') as infile:
    file_line_limit = 300
    counter = -1
    file_index = 0
    outfile = None
    for line in infile.readlines():
        counter += 1
        if counter % file_line_limit == 0:
            # close old file
            if outfile is not None:
                outfile.close()
            # create new file
            file_index += 1
            outfile = open('small_file_%03d.txt' % file_index, 'w')
        # write to file
        outfile.write(line)

answered Oct 12 '22 21:10

Guiyi Yuan

Related questions
                            
                                TypeError: string indices must be integers while parsing JSON using Python?
                            
                                Django REST Framework and FileField absolute url
                            
                                How to send a POST request using django?
                            
                                How do I return HTTP error code without default template in Tornado?
                            
                                JSON Schema: validate a number-or-null value
                            
                                Dump stacktraces of all active Threads
                            
                                How to log error to file, and not fail on exception
                            
                                How to resolve "iterator should return strings, not bytes"
                            
                                Getting TypeError: '(slice(None, None, None), 0)' is an invalid key
                            
                                How can I make the PyDev editor selectively ignore errors?
                            
                                Place image over PDF
                            
                                Maximal Length of List to Shuffle with Python random.shuffle?
                            
                                How can I programmatically change the background in Mac OS X?
                            
                                UnicodeDecodeError : 'ascii' codec can't decode byte 0xe0 in position 0: ordinal not in range(128)
                            
                                Why does 1 == True but 2 != True in Python? [duplicate]
                            
                                Python Matplotlib Boxplot Color
                            
                                Printing BFS (Binary Tree) in Level Order with Specific Formatting
                            
                                python: most elegant way to intersperse a list with an element
                            
                                How to "scale" a numpy array?
                            
                                List of python keywords [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Splitting large text file into smaller text files by line numbers using Python

Tags:

python

file

split

lines