In Python, f.readline() returns the next line from the file f. That is, it starts at the current position of f, reads till it encounters a line break, returns everything in between and updates the position of f.
Now I want to do the exactly the same, but for whitespace separated files (not only newlines). For example, consider a file f with the content
token1 token2
token3                            token4
         token5
So I'm looking for some function readtoken() such that after opening f, the first call of f.readtoken() returns token1, the second call retuns token2 etc.
For efficiency and to avoid problems with very long lines or very large files, there should be no buffering.
I was almost sure that this should be possible "out of the box" with the standard library. However, I didn't find any suitable function or a way to redefine the delimiters for readline().
You'd need to create a wrapper function; this is easy enough:
def read_by_tokens(fileobj):
    for line in fileobj:
        for token in line.split():
            yield token
Note that .readline() doesn't just read a file character by character until a newline is encountered; the file is read in blocks (a buffer) to improve performance.
The above method reads the file by lines but yields the result split on whitespace. Use it like:
with open('somefilename') as f:
    for token in read_by_tokens(f):
        print(token)
Because read_by_tokens() is a generator, you either need to loop directly over the function result, or use the next() function to get tokens one by one:
with open('somefilename') as f:
    tokenized = read_by_tokens(f)
    # read first two tokens separately
    first_token = next(tokenized)
    second_token = next(tokenized)
    for token in tokenized:
        # loops over all tokens *except the first two*
        print(token)
                        If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With