In Python, f.readline()
returns the next line from the file f
. That is, it starts at the current position of f
, reads till it encounters a line break, returns everything in between and updates the position of f
.
Now I want to do the exactly the same, but for whitespace separated files (not only newlines). For example, consider a file f
with the content
token1 token2
token3 token4
token5
So I'm looking for some function readtoken()
such that after opening f
, the first call of f.readtoken()
returns token1
, the second call retuns token2
etc.
For efficiency and to avoid problems with very long lines or very large files, there should be no buffering.
I was almost sure that this should be possible "out of the box" with the standard library. However, I didn't find any suitable function or a way to redefine the delimiters for readline()
.
You'd need to create a wrapper function; this is easy enough:
def read_by_tokens(fileobj):
for line in fileobj:
for token in line.split():
yield token
Note that .readline()
doesn't just read a file character by character until a newline is encountered; the file is read in blocks (a buffer) to improve performance.
The above method reads the file by lines but yields the result split on whitespace. Use it like:
with open('somefilename') as f:
for token in read_by_tokens(f):
print(token)
Because read_by_tokens()
is a generator, you either need to loop directly over the function result, or use the next()
function to get tokens one by one:
with open('somefilename') as f:
tokenized = read_by_tokens(f)
# read first two tokens separately
first_token = next(tokenized)
second_token = next(tokenized)
for token in tokenized:
# loops over all tokens *except the first two*
print(token)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With