What's the fastest way to find the byte position of a specific line in a file, from the command line?
e.g.
$ linepos myfile.txt 13
5283
I'm writing a parser for a CSV that's several GB in size, and in the event the parser is halted, I'd like to be able to resume from the last position. The parser is in Python, but even iterating over file.readlines()
takes a long time, since there are millions of rows in the file. I'd like to simply do file.seek(int(command.getoutput("linepos myfile.txt %i" % lastrow)))
, but I can't find a shell command to efficiently do this.
Edit: Sorry for the confusion, but I'm looking for a non-Python solution. I already know how to do this from Python.
The byte offset is just the count of the bytes, starting at 0. The big question is: how are the 16-bit offsets for the branch instructions calculated. The big answer is: count the number of bytes to the destination. The first branch is in instruction 7 in the IJVM code, and at offset 11 in the hex byte code.
byte offset is the number of character that exists counting from the beginning of a line. for example, this line. what is byte offset? will have a byte offset of 19. This is used as key value in hadoop.
A position is used to report errors in CSV data. All positions include the byte offset, line number and record index at which the error occurred. Byte offsets and record indices start at 0 . Line numbers start at 1 . A CSV reader will automatically assign the position of each record.
From @chepner's comment on my other answer:
position = 0 # or wherever you left off last time
try:
with open('myfile.txt') as file:
file.seek(position) # zero in base case
for line in file:
position = file.tell() # current seek position in file
# process the line
except:
print 'exception occurred at position {}'.format(position)
raise
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With