Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

using a python generator to process large text files

I'm new to using generators and have read around a bit but need some help processing large text files in chunks. I know this topic has been covered but example code has very limited explanations making it difficult to modify the code if one doesn't understand what is going on.

My problem is fairly simple, I have a series of large text files containing human genome sequencing data in the following format:

chr22   1   0
chr22   2   0
chr22   3   1
chr22   4   1
chr22   5   1
chr22   6   2

The files range between 1Gb and ~20Gb in length which is too big to read into RAM. So I would like to read the lines in chunks/bins of say 10000 lines at a time so that I can perform calculations on the final column in these bin sizes.

Based on this link here I have written the following:

def read_large_file(file_object):
    """A generator function to read a large file lazily."""

    bin_size=5000
    start=0
    end=start+bin_size

    # Read a block from the file: data
    while True:
        data = file_object.readlines(end) 
        if not data:
            break
        start=start+bin_size
        end=end+bin_size
        yield data


def process_file(path):

    try:
        # Open a connection to the file
        with open(path) as file_handler:
            # Create a generator object for the file: gen_file
            for block in read_large_file(file_handler):
                print(block)
                # process block

    except (IOError, OSError):
        print("Error opening / processing file")    
    return    

if __name__ == '__main__':
            path='C:/path_to/input.txt'
    process_file(path)

within 'process_block' I expected the returned 'block' object to be a list 10000 elements long but its not? The first list is 843 elements. The second is 2394 elements?

I want to get back 'N' number of lines in a block but am very confused by what is happening here?

This solution here seems like it could help but again I don't understand how to modify it to read N-lines at a time?

This here also looks like a really great solution but again, there isn't enough background explanation for me to understand enough to modify the code.

Any help would be really appreciated?

like image 799
user3062260 Avatar asked Apr 10 '18 11:04

user3062260


People also ask

How do you process large files in Python?

Learn various techniques to reduce data processing time by using multiprocessing, joblib, and tqdm concurrent. For parallel processing, we divide our task into sub-units. It increases the number of jobs processed by the program and reduces overall processing time.

How do I read a 100gb file in Python?

Read large text files in Python using iterateThe input() method of fileinput module can be used to read large files. This method takes a list of filenames and if no parameter is passed it accepts input from the stdin, and returns an iterator that returns individual lines from the text file being scanned.


2 Answers

Instead of playing with offsets in the file, try to build and yield lists of 10000 elements from a loop:

def read_large_file(file_handler, block_size=10000):
    block = []
    for line in file_handler:
        block.append(line)
        if len(block) == block_size:
            yield block
            block = []

    # don't forget to yield the last block
    if block:
        yield block

with open(path) as file_handler:
    for block in read_large_file(file_handler):
        print(block)
like image 134
pawamoy Avatar answered Nov 15 '22 04:11

pawamoy


Not a proper answer but finding out the why of this behaviour takes approximately 27 seconds:

(blook)bruno@bigb:~/Work/blookup/src/project$ python
Python 2.7.6 (default, Jun 22 2015, 17:58:13) 
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
pythonrc start
pythonrc done
>>> help(file.readlines)

Help on method_descriptor:

readlines(...)
    readlines([size]) -> list of strings, each a line from the file.

    Call readline() repeatedly and return a list of the lines so read.
    The optional size argument, if given, is an approximate bound on the
    total number of bytes in the lines returned.

I understand that not everyone here is a professional programmer - and of course that the documentation is not always enough to solve a problem (and I happily answer those kind of questions), but really the number of questions where the answer is written in plain letters at the start of the doc becomes a bit annoying.

like image 33
bruno desthuilliers Avatar answered Nov 15 '22 02:11

bruno desthuilliers