I have a text file which contains a time stamp on each line. My goal is to find the time range. All the times are in order so the first line will be the earliest time and the last line will be the latest time. I only need the very first and very last line. What would be the most efficient way to get these lines in python?
Note: These files are relatively large in length, about 1-2 million lines each and I have to do this for several hundred files.
By default, head shows you the first 10 lines of a file. You can change this by typing head -number filename, where number is the number of lines you want to see. To look at the last few lines of a file, use the tail command.
txt file in the read mode and used the f. readlines()[-1] to read the last line of the file. We used [-1] because the readlines() function returns all the lines in the form of a list, and this [-1] index gives us the last element of that list.
The readline() is a built-in function that returns one line from the file. Open a file using open(filename, mode) as a file with mode “r” and call readline() function on that file object to get the first line of the file.
To read both the first and final line of a file you could...
readline()
, ...def readlastline(f): f.seek(-2, 2) # Jump to the second last byte. while f.read(1) != b"\n": # Until EOL is found ... f.seek(-2, 1) # ... jump back, over the read byte plus one more. return f.read() # Read all data from this point on. with open(file, "rb") as f: first = f.readline() last = readlastline(f)
Jump to the second last byte directly to prevent trailing newline characters to cause empty lines to be returned*.
The current offset is pushed ahead by one every time a byte is read so the stepping backwards is done two bytes at a time, past the recently read byte and the byte to read next.
The whence
parameter passed to fseek(offset, whence=0)
indicates that fseek
should seek to a position offset
bytes relative to...
0
or os.SEEK_SET
= The beginning of the file.1
or os.SEEK_CUR
= The current position.2
or os.SEEK_END
= The end of the file.* As would be expected as the default behavior of most applications, including print
and echo
, is to append one to every line written and has no effect on lines missing trailing newline character.
1-2 million lines each and I have to do this for several hundred files.
I timed this method and compared it against against the top answer.
10k iterations processing a file of 6k lines totalling 200kB: 1.62s vs 6.92s. 100 iterations processing a file of 6k lines totalling 1.3GB: 8.93s vs 86.95.
Millions of lines would increase the difference a lot more.
Exakt code used for timing:
with open(file, "rb") as f: first = f.readline() # Read and store the first line. for last in f: pass # Read all lines, keep final value.
A more complex, and harder to read, variation to address comments and issues raised since.
Also adds support for multibyte delimiters, readlast(b'X<br>Y', b'<br>', fixed=False)
.
Please note that this variation is really slow for large files because of the non-relative offsets needed in text mode. Modify to your need, or do not use it at all as you're probably better off using f.readlines()[-1]
with files opened in text mode.
#!/bin/python3 from os import SEEK_END def readlast(f, sep, fixed=True): r"""Read the last segment from a file-like object. :param f: File to read last line from. :type f: file-like object :param sep: Segment separator (delimiter). :type sep: bytes, str :param fixed: Treat data in ``f`` as a chain of fixed size blocks. :type fixed: bool :returns: Last line of file. :rtype: bytes, str """ bs = len(sep) step = bs if fixed else 1 if not bs: raise ValueError("Zero-length separator.") try: o = f.seek(0, SEEK_END) o = f.seek(o-bs-step) # - Ignore trailing delimiter 'sep'. while f.read(bs) != sep: # - Until reaching 'sep': Read sep-sized block o = f.seek(o-step) # and then seek to the block to read next. except (OSError,ValueError): # - Beginning of file reached. f.seek(0) return f.read() def test_readlast(): from io import BytesIO, StringIO # Text mode. f = StringIO("first\nlast\n") assert readlast(f, "\n") == "last\n" # Bytes. f = BytesIO(b'first|last') assert readlast(f, b'|') == b'last' # Bytes, UTF-8. f = BytesIO("X\nY\n".encode("utf-8")) assert readlast(f, b'\n').decode() == "Y\n" # Bytes, UTF-16. f = BytesIO("X\nY\n".encode("utf-16")) assert readlast(f, b'\n\x00').decode('utf-16') == "Y\n" # Bytes, UTF-32. f = BytesIO("X\nY\n".encode("utf-32")) assert readlast(f, b'\n\x00\x00\x00').decode('utf-32') == "Y\n" # Multichar delimiter. f = StringIO("X<br>Y") assert readlast(f, "<br>", fixed=False) == "Y" # Make sure you use the correct delimiters. seps = { 'utf8': b'\n', 'utf16': b'\n\x00', 'utf32': b'\n\x00\x00\x00' } assert "\n".encode('utf8' ) == seps['utf8'] assert "\n".encode('utf16')[2:] == seps['utf16'] assert "\n".encode('utf32')[4:] == seps['utf32'] # Edge cases. edges = ( # Text , Match ("" , "" ), # Empty file, empty string. ("X" , "X" ), # No delimiter, full content. ("\n" , "\n"), ("\n\n", "\n"), # UTF16/32 encoded U+270A (b"\n\x00\n'\n\x00"/utf16) (b'\n\xe2\x9c\x8a\n'.decode(), b'\xe2\x9c\x8a\n'.decode()), ) for txt, match in edges: for enc,sep in seps.items(): assert readlast(BytesIO(txt.encode(enc)), sep).decode(enc) == match if __name__ == "__main__": import sys for path in sys.argv[1:]: with open(path) as f: print(f.readline() , end="") print(readlast(f,"\n"), end="")
docs for io module
with open(fname, 'rb') as fh: first = next(fh).decode() fh.seek(-1024, 2) last = fh.readlines()[-1].decode()
The variable value here is 1024: it represents the average string length. I choose 1024 only for example. If you have an estimate of average line length you could just use that value times 2.
Since you have no idea whatsoever about the possible upper bound for the line length, the obvious solution would be to loop over the file:
for line in fh: pass last = line
You don't need to bother with the binary flag you could just use open(fname)
.
ETA: Since you have many files to work on, you could create a sample of couple of dozens of files using random.sample
and run this code on them to determine length of last line. With an a priori large value of the position shift (let say 1 MB). This will help you to estimate the value for the full run.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With