Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficiently read data in python (only one line)

For a upcoming programming competition I solved a few of the tasks of former competitions. Each task looks like this: We get a bunch of in-files (each containing 1 line of numbers and strings, f.e. "2 15 test 23 ..."), and we have to build a program and return some computed values.

These in-files can be quite large: for instance 10 MB. My code is the following:

with open(filename) as f:
    input_data = f.read().split()

This is quite slow. I quess mostly because of the split method. Is there a faster way?

like image 961
Jakube Avatar asked Jan 29 '26 03:01

Jakube


2 Answers

What you have already looks like the best way for plain text IO on a one-line file.

10 MB of plain text is fairly large, if you need some more speedup you could consider pickling the data in a binary format instead of a plain text format. Or if it is very repetitive data, you could store it compressed.

like image 188
wim Avatar answered Jan 31 '26 21:01

wim


If one of your input files contains independent tasks (that is, you can work on a couple of tokens of the line at a time, without knowing tokens further ahead), you can do reading and processing in lockstep, by simpy not reading the whole file at once.

def read_groups(f):
    chunksize= 4096 #how many bytes to read from the file at once
    buf= f.read(chunksize)
    while buf:
        if entire_group_inside(buf): #checks if you have enough data to process on buf
            i= next_group_index(buf) #returns the index on the next group of tokens
            group, buf= buf[:i], buf[i:]
            yield group
        else:
            buf+= f.read(chunksize)

with open(filename) as f:
    for data in read_groups(f):
        #do something

This has some advantages:

  • You don't need to read the whole file into memory (which, for 10 MB on a desktop, probably doesn't matter much)
  • if you do a lot of processing on each group of tokens, it may lead to better performance, as you'll have alternating I/O and CPU bound tasks. Modern OSs use sequential file prefetching to optimize file linear access, so, in practice, if you lockstep I/O and CPU, your I/O will end up being executed in parallel by the OS. Even if your OS has no such functionality, if you have a modern disk, it'll probably cache sequential access to blocks.

If you don't have much processing, though, your task is fundamentally I/O bound, and there isn't much you can do to speed it up as it stands, as wim said - other than rethinking your input data format

like image 40
loopbackbee Avatar answered Jan 31 '26 19:01

loopbackbee



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!