Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Read file up to a character

Tags:

python

I am writing a script to process X12 EDI files, which I would like to iterate line-by-line. The files are composed of a sequence of distinct records, each terminated by a special character (e.g. ~, but see below). The files may be large (>100 MB), so I do not want to read the whole thing in and split it. The records are not newline-separated; reading in the first line would probably read the whole file. The files are all-ASCII.

Python clearly provides for reading a file up to a certain character, provided that that character is a newline. I would like to do the same thing with an arbitrary character. I presume that reading by line is implemented via buffering. I could implement my own buffered reader, but I would rather avoid the extra code and the overhead if there is a better solution.

Note: I've seen a few similar questions, but they all seemed to conclude that one should read the file in by the line, presuming that the lines would be a reasonable size. In this case, the whole file will probably be one line.

Edit: The segment terminator character is whatever the 106th byte of the file is. It is not known before the script is invoked.

like image 937
Thom Smith Avatar asked Feb 17 '16 14:02

Thom Smith


People also ask

How do I read a character from a file?

fgetc() is used to obtain input from a file single character at a time. This function returns the ASCII code of the character read by the function. It returns the character present at position indicated by file pointer. After reading the character, the file pointer is advanced to next character.

How a file can be read character by character and write into an another file?

We use the getc() and putc() I/O functions to read a character from a file and write a character to a file respectively.


2 Answers

If there aren't going to be newlines in the file to start with, transform the file before piping it into your Python script, e.g.:

tr '~' '\n' < source.txt | my-script.py

Then use readline(), readlines(), or for line in file_object: as appropriate.

like image 152
Wolf Avatar answered Sep 28 '22 06:09

Wolf


This is still far from optimal, but it would be a pure-Python implementation of a very simple buffer:

def my_open(filename, char):
    with open(filename) as f:
        old_fb=""
        for file_buffer in iter(lambda: f.read(1024), ''):
            if old_fb:
                file_buffer = old_fb + file_buffer
            pos = file_buffer.find(char)
            while pos != -1 and file_buffer:
                yield file_buffer[:pos]
                file_buffer = file_buffer[pos+1:]
                pos = file_buffer.find(char)
            old_fb = file_buffer
        yield old_fb

# Usage:
for line in my_open("weirdfile", "~"):
    print(line)
like image 41
L3viathan Avatar answered Sep 28 '22 05:09

L3viathan