Read file up to a character

Tags:

python

I am writing a script to process X12 EDI files, which I would like to iterate line-by-line. The files are composed of a sequence of distinct records, each terminated by a special character (e.g. ~, but see below). The files may be large (>100 MB), so I do not want to read the whole thing in and split it. The records are not newline-separated; reading in the first line would probably read the whole file. The files are all-ASCII.

Python clearly provides for reading a file up to a certain character, provided that that character is a newline. I would like to do the same thing with an arbitrary character. I presume that reading by line is implemented via buffering. I could implement my own buffered reader, but I would rather avoid the extra code and the overhead if there is a better solution.

Note: I've seen a few similar questions, but they all seemed to conclude that one should read the file in by the line, presuming that the lines would be a reasonable size. In this case, the whole file will probably be one line.

Edit: The segment terminator character is whatever the 106th byte of the file is. It is not known before the script is invoked.

937

asked Feb 17 '16 14:02

Thom Smith

2 Answers

If there aren't going to be newlines in the file to start with, transform the file before piping it into your Python script, e.g.:

tr '~' '\n' < source.txt | my-script.py

Then use readline(), readlines(), or for line in file_object: as appropriate.

152

answered Sep 28 '22 06:09

Wolf

This is still far from optimal, but it would be a pure-Python implementation of a very simple buffer:

def my_open(filename, char):
    with open(filename) as f:
        old_fb=""
        for file_buffer in iter(lambda: f.read(1024), ''):
            if old_fb:
                file_buffer = old_fb + file_buffer
            pos = file_buffer.find(char)
            while pos != -1 and file_buffer:
                yield file_buffer[:pos]
                file_buffer = file_buffer[pos+1:]
                pos = file_buffer.find(char)
            old_fb = file_buffer
        yield old_fb

# Usage:
for line in my_open("weirdfile", "~"):
    print(line)

answered Sep 28 '22 05:09

L3viathan

Related questions
                            
                                psycopg2: Writing JSON objects using copy_from. How to format the json string?
                            
                                Trying to strip b' ' from my Numpy array's savetxt() representation
                            
                                No module named 'pip._vendor.cachecontrol'
                            
                                os.kill not working on spawned process
                            
                                python + google drive: upload xlsx, convert to google sheet, get sharable link
                            
                                chaining coroutines in asyncio (and observer pattern)
                            
                                Is it possible to use websockets in Flask and Python 3?
                            
                                Indexing/Searching "complex" JSON in elasticsearch
                            
                                Get current URL from browser using python
                            
                                Multiprocessing of Scrapy Spiders in Parallel Processes
                            
                                How to apply functools.lru_cache to function with mutable parameters?
                            
                                Plotting 3D graphics in Python 3
                            
                                Setting up Django on IIS
                            
                                Implementing log Gabor filter bank
                            
                                Selenium Webdriver / Beautifulsoup + Web Scraping + Error 416
                            
                                python argparse named positional arguments?
                            
                                Celery + Flask with SQLite as broker, error when calling task
                            
                                .NET/C# Interop to Python
                            
                                TypeError: cannot append a non-category item to a CategoricalIndex
                            
                                Mix-in of abstract class and namedtuple

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With