I am writing a script to process X12 EDI files, which I would like to iterate line-by-line. The files are composed of a sequence of distinct records, each terminated by a special character (e.g. ~
, but see below). The files may be large (>100 MB), so I do not want to read the whole thing in and split it. The records are not newline-separated; reading in the first line would probably read the whole file. The files are all-ASCII.
Python clearly provides for reading a file up to a certain character, provided that that character is a newline. I would like to do the same thing with an arbitrary character. I presume that reading by line is implemented via buffering. I could implement my own buffered reader, but I would rather avoid the extra code and the overhead if there is a better solution.
Note: I've seen a few similar questions, but they all seemed to conclude that one should read the file in by the line, presuming that the lines would be a reasonable size. In this case, the whole file will probably be one line.
Edit: The segment terminator character is whatever the 106th byte of the file is. It is not known before the script is invoked.
fgetc() is used to obtain input from a file single character at a time. This function returns the ASCII code of the character read by the function. It returns the character present at position indicated by file pointer. After reading the character, the file pointer is advanced to next character.
We use the getc() and putc() I/O functions to read a character from a file and write a character to a file respectively.
If there aren't going to be newlines in the file to start with, transform the file before piping it into your Python script, e.g.:
tr '~' '\n' < source.txt | my-script.py
Then use readline()
, readlines()
, or for line in file_object:
as appropriate.
This is still far from optimal, but it would be a pure-Python implementation of a very simple buffer:
def my_open(filename, char):
with open(filename) as f:
old_fb=""
for file_buffer in iter(lambda: f.read(1024), ''):
if old_fb:
file_buffer = old_fb + file_buffer
pos = file_buffer.find(char)
while pos != -1 and file_buffer:
yield file_buffer[:pos]
file_buffer = file_buffer[pos+1:]
pos = file_buffer.find(char)
old_fb = file_buffer
yield old_fb
# Usage:
for line in my_open("weirdfile", "~"):
print(line)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With