Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I split a huge text file in python

I have a huge text file (~1GB) and sadly the text editor I use won't read such a large file. However, if I can just split it into two or three parts I'll be fine, so, as an exercise I wanted to write a program in python to do it.

What I think I want the program to do is to find the size of a file, divide that number into parts, and for each part, read up to that point in chunks, writing to a filename.nnn output file, then read up-to the next line-break and write that, then close the output file, etc. Obviously the last output file just copies to the end of the input file.

Can you help me with the key filesystem related parts: filesize, reading and writing in chunks and reading to a line-break?

I'll be writing this code test-first, so there's no need to give me a complete answer, unless its a one-liner ;-)

like image 661
quamrana Avatar asked Nov 14 '08 23:11

quamrana


People also ask

How do you split a large text in Python?

The fastest way to split text in Python is with the split() method. This is a built-in method that is useful for separating a string into its individual parts. The split() method will return a list of the elements in a string.

How do you split a large file into smaller parts in Python?

To split a big binary file in multiple files, you should first read the file by the size of chunk you want to create, then write that chunk to a file, read the next chunk and repeat until you reach the end of original file.

How do you split data in a text file in Python?

We can use a for loop to iterate through the contents of the data file after opening it with Python's 'with' statement. After reading the data, the split() method is used to split the text into words. The split() method by default separates text using whitespace.


3 Answers

linux has a split command

split -l 100000 file.txt

would split into files of equal 100,000 line size

like image 181
James Avatar answered Oct 02 '22 14:10

James


Check out os.stat() for file size and file.readlines([sizehint]). Those two functions should be all you need for the reading part, and hopefully you know how to do the writing :)

like image 36
Kamil Kisiel Avatar answered Oct 02 '22 16:10

Kamil Kisiel


As an alternative method, using the logging library:

>>> import logging.handlers
>>> log = logging.getLogger()
>>> fh = logging.handlers.RotatingFileHandler("D://filename.txt", 
     maxBytes=2**20*100, backupCount=100) 
# 100 MB each, up to a maximum of 100 files
>>> log.addHandler(fh)
>>> log.setLevel(logging.INFO)
>>> f = open("D://biglog.txt")
>>> while True:
...     log.info(f.readline().strip())

Your files will appear as follows:

filename.txt (end of file)
filename.txt.1
filename.txt.2
...
filename.txt.10 (start of file)

This is a quick and easy way to make a huge log file match your RotatingFileHandler implementation.

like image 30
Alex L Avatar answered Oct 02 '22 16:10

Alex L