using a python generator to process large text files

Tags:

I'm new to using generators and have read around a bit but need some help processing large text files in chunks. I know this topic has been covered but example code has very limited explanations making it difficult to modify the code if one doesn't understand what is going on.

My problem is fairly simple, I have a series of large text files containing human genome sequencing data in the following format:

Click to copy

chr22   1   0
chr22   2   0
chr22   3   1
chr22   4   1
chr22   5   1
chr22   6   2

The files range between 1Gb and ~20Gb in length which is too big to read into RAM. So I would like to read the lines in chunks/bins of say 10000 lines at a time so that I can perform calculations on the final column in these bin sizes.

Based on this link here I have written the following:

Click to copy

def read_large_file(file_object):
    """A generator function to read a large file lazily."""

    bin_size=5000
    start=0
    end=start+bin_size

    # Read a block from the file: data
    while True:
        data = file_object.readlines(end) 
        if not data:
            break
        start=start+bin_size
        end=end+bin_size
        yield data


def process_file(path):

    try:
        # Open a connection to the file
        with open(path) as file_handler:
            # Create a generator object for the file: gen_file
            for block in read_large_file(file_handler):
                print(block)
                # process block

    except (IOError, OSError):
        print("Error opening / processing file")    
    return    

if __name__ == '__main__':
            path='C:/path_to/input.txt'
    process_file(path)

within 'process_block' I expected the returned 'block' object to be a list 10000 elements long but its not? The first list is 843 elements. The second is 2394 elements?

I want to get back 'N' number of lines in a block but am very confused by what is happening here?

This solution here seems like it could help but again I don't understand how to modify it to read N-lines at a time?

This here also looks like a really great solution but again, there isn't enough background explanation for me to understand enough to modify the code.

Any help would be really appreciated?

799

asked Apr 10 '18 11:04

user3062260

2 Answers

Instead of playing with offsets in the file, try to build and yield lists of 10000 elements from a loop:

Click to copy

def read_large_file(file_handler, block_size=10000):
    block = []
    for line in file_handler:
        block.append(line)
        if len(block) == block_size:
            yield block
            block = []

    # don't forget to yield the last block
    if block:
        yield block

with open(path) as file_handler:
    for block in read_large_file(file_handler):
        print(block)

134

answered Nov 15 '22 04:11

pawamoy

Not a proper answer but finding out the why of this behaviour takes approximately 27 seconds:

Click to copy

(blook)bruno@bigb:~/Work/blookup/src/project$ python
Python 2.7.6 (default, Jun 22 2015, 17:58:13) 
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
pythonrc start
pythonrc done
>>> help(file.readlines)

Help on method_descriptor:

readlines(...)
    readlines([size]) -> list of strings, each a line from the file.

    Call readline() repeatedly and return a list of the lines so read.
    The optional size argument, if given, is an approximate bound on the
    total number of bytes in the lines returned.

I understand that not everyone here is a professional programmer - and of course that the documentation is not always enough to solve a problem (and I happily answer those kind of questions), but really the number of questions where the answer is written in plain letters at the start of the doc becomes a bit annoying.

answered Nov 15 '22 02:11

bruno desthuilliers

Related questions
                            
                                Python - How to convert an array of json objects to a Dataframe?
                            
                                Tensorflow Error: "Cannot parse tensor from proto"
                            
                                python xgboost continue training on existing model
                            
                                Converting Pixels to LatLng Coordinates from google static image
                            
                                class returns <bound method ...> instead of the value I returned (python)
                            
                                How to find shared library used by a python module?
                            
                                Pandas: concatenate a list of columns into one column
                            
                                How to clean iPython environment so I can start over with Jupyter and Python 3.x?
                            
                                pyspark throws TypeError: textFile() missing 1 required positional argument: 'name'
                            
                                Deleting Items based on Keys in nested Dictionaries in python
                            
                                Python: Write Pandas Dataframe to MSSQL --> Database Error
                            
                                Stop Iteration error when using next()
                            
                                Best way to use python-dotenv with pytest, or best way to have a pytest test/dev-environment with seperate configs
                            
                                Python range len vs enumerate
                            
                                Round to nearest 5 with numpy
                            
                                How to extract hour, minute and second from Series filled with datetime.time values
                            
                                Jupyter notebook error Windows 10
                            
                                Generating test data - how to generate a valid address for a given US zipcode?
                            
                                how to open image as rgb file when using imageio
                            
                                SOAP API with Python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

using a python generator to process large text files

Tags:

python

generator

chunks

large-files

user3062260

People also ask

2 Answers

pawamoy

bruno desthuilliers

Recent Activity

Donate For Us