Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

efficient way of reading integers from file

Tags:

python

I'd like to read all integers from a file into the one list. All numbers are separated by space (one or more) or end line character (one or more). What is the most efficient and/or elegant way of doing this? I have two solutions, but I don't know if they are good or not.

  1. Checking for digits:

    for line in open("foo.txt", "r"):
        for i in line.strip().split(' '):
            if i.isdigit():
                my_list.append(int(i))
    
  2. Dealing with exceptions:

    for line in open("foo.txt", "r"):
        for i in line:
            try:
                my_list.append(int(i))
            except ValueError:
                pass
    

Sample data:

1   2     3
 4 56
    789         
9          91 56   

 10 
11 
like image 210
Marcel Avatar asked Jul 31 '15 09:07

Marcel


2 Answers

An efficient way of doing it would be your first method with a small change of using with statement for opening the file , Example -

with open("foo.txt", "r") as f:
    for line in f:
        for i in line.split():
            if i.isdigit():
                my_list.append(int(i))

Timing tests done with comparisons to other methods -

The functions -

def func1():
    my_list = []
    for line in open("foo.txt", "r"):
        for i in line.strip().split(' '):
            if i.isdigit():
                my_list.append(int(i))
    return my_list

def func1_1():
    return [int(i) for line in open("foo.txt", "r") for i in line.strip().split(' ') if i.isdigit()]

def func1_3():
    my_list = []
    with open("foo.txt", "r") as f:
        for line in f:
            for i in line.split():
                if i.isdigit():
                    my_list.append(int(i))
    return my_list

def func2():            
    my_list = []            
    for line in open("foo.txt", "r"):
        for i in line.split():
            try:
                my_list.append(int(i))
            except ValueError:
                pass
    return my_list

def func3():
    my_list = []
    with open("foo.txt","r") as f:
        cf = csv.reader(f, delimiter=' ')
        for row in cf:
            my_list.extend([int(i) for i in row if i.isdigit()])
    return my_list

Results of timing tests -

In [25]: timeit func1()
The slowest run took 4.70 times longer than the fastest. This could mean that an intermediate result is being cached
1000 loops, best of 3: 204 µs per loop

In [26]: timeit func1_1()
The slowest run took 4.39 times longer than the fastest. This could mean that an intermediate result is being cached
1000 loops, best of 3: 207 µs per loop

In [27]: timeit func1_3()
The slowest run took 5.46 times longer than the fastest. This could mean that an intermediate result is being cached
10000 loops, best of 3: 191 µs per loop

In [28]: timeit func2()
The slowest run took 4.09 times longer than the fastest. This could mean that an intermediate result is being cached
1000 loops, best of 3: 212 µs per loop

In [34]: timeit func3()
The slowest run took 4.38 times longer than the fastest. This could mean that an intermediate result is being cached
10000 loops, best of 3: 202 µs per loop

Given the methods that store the data into a list, I believe func1_3() above is fastest (As shown by the timeit).


But given that , if you are really handling very large files , then you maybe better off using a generator rather than storing the complete list in memory.


UPDATE : As it was being said in the comments that func2() is faster than func1_3() (Though on my system it was never faster than func1_3() even for only integers) , updated the foo.txt to contain things other than numbers and taking timing tests -

foo.txt

1 2 10 11
asd dd
 dds asda
22 44 32 11   23
dd dsa dds
21 12
12
33
45
dds
asdas
dasdasd dasd das d asda sda

Test -

In [13]: %timeit func1_3()
The slowest run took 6.17 times longer than the fastest. This could mean that an intermediate result is being cached
1000 loops, best of 3: 210 µs per loop

In [14]: %timeit func2()
1000 loops, best of 3: 279 µs per loop

In [15]: %timeit func1_3()
1000 loops, best of 3: 213 µs per loop

In [16]: %timeit func2()
1000 loops, best of 3: 273 µs per loop
like image 98
Anand S Kumar Avatar answered Nov 11 '22 21:11

Anand S Kumar


It's pretty easy if you can read the whole file as a string. (ie. it's not too large to do that)

fileStr = open('foo.txt').read().split() 
integers = [int(x) for x in fileStr if x.isdigit()]

read() turns it into a long string, and split splits apart into a list of strings based on whitespace (ie. Spaces and newlines). So you can combine that with a list comprehension that converts them to integers if they're digits.

As Bakuriu noted, if the file is guaranteed to only have whitespace and numbers, then you don't have to check for isdigit(). Using list(map(int, open('foo.txt').read().split())) would be enough in that case. That method will raise errors if anything is an invalid integer whereas the other will skip anything that isn't a recognised digit.

like image 38
SuperBiasedMan Avatar answered Nov 11 '22 23:11

SuperBiasedMan