Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficiently processing data in text file

Tags:

python

file

Lets assume I have a (text) file with the following structure (name, score):

 a         0
 a         1
 b         0
 c         0
 d         3
 b         2

And so on. My aim is to sum the scores for every name and order them from highest score to lowest score. So in this case, I want the following output:

 d         3
 b         2
 a         1
 c         0

In advance I do not know what names will be in the file.

I was wondering if there is an efficient way to do this. My text file can contain up to 50,000 entries.

The only way I can think of is just start at line 1, remember that name and then go over the whole file to look for that name and sum. This looks horribly inefficient, so I was wondering if there is a better way to do this.

like image 508
Nigel Overmars Avatar asked Dec 04 '15 11:12

Nigel Overmars


2 Answers

Read all data into a dictionary:

from collections import defaultdict
from operator import itemgetter

scores = defaultdict(int)
with open('my_file.txt') as fobj:
    for line in fobj:
        name, score = line.split()
        scores[name] += int(score)

and the sorting:

for name, score in sorted(scores.items(), key=itemgetter(1), reverse=True):
    print(name, score)

prints:

d 3
b 2
a 1
c 0

Performance

To check the performance of this answer vs. the one from @SvenMarnach, I put both approaches into a function. Here fobj is a file opened for reading. I use io.StringIO so IO delays should, hopefully, not be measured:

from collections import Counter

def counter(fobj):
    scores = Counter()
    fobj.seek(0)
    for line in fobj:
        key, score = line.split()
        scores.update({key: int(score)})
    return scores.most_common()

from collections import defaultdict
from operator import itemgetter

def default(fobj):
    scores = defaultdict(int)
    fobj.seek(0)
    for line in fobj:
        name, score = line.split()
        scores[name] += int(score)
    return sorted(scores.items(), key=itemgetter(1), reverse=True)

Results for collections.Counter:

%timeit counter(fobj)
10000 loops, best of 3: 59.1 µs per loop

Results for collections.defaultdict:

%timeit default(fobj)
10000 loops, best of 3: 15.8 µs per loop

Looks like defaultdictis four times faster. I would not have guessed this. But when it comes to performance you need to measure.

like image 75
Mike Müller Avatar answered Oct 19 '22 22:10

Mike Müller


This is a good use case for collections.Counter:

from collections import Counter

scores = Counter()
with open('my_file') as f:
    for line in f:
        key, score = line.split()
        scores.update({key: int(score)})

for key, score in scores.most_common():
    print(key, score)
like image 43
Sven Marnach Avatar answered Oct 19 '22 22:10

Sven Marnach