Efficiently processing data in text file

Question

Lets assume I have a (text) file with the following structure (name, score):

 a         0
 a         1
 b         0
 c         0
 d         3
 b         2

And so on. My aim is to sum the scores for every name and order them from highest score to lowest score. So in this case, I want the following output:

 d         3
 b         2
 a         1
 c         0

In advance I do not know what names will be in the file.

I was wondering if there is an efficient way to do this. My text file can contain up to 50,000 entries.

The only way I can think of is just start at line 1, remember that name and then go over the whole file to look for that name and sum. This looks horribly inefficient, so I was wondering if there is a better way to do this.

Mike Müller · Accepted Answer

Read all data into a dictionary:

from collections import defaultdict
from operator import itemgetter

scores = defaultdict(int)
with open('my_file.txt') as fobj:
    for line in fobj:
        name, score = line.split()
        scores[name] += int(score)

and the sorting:

for name, score in sorted(scores.items(), key=itemgetter(1), reverse=True):
    print(name, score)

prints:

d 3
b 2
a 1
c 0

Performance

To check the performance of this answer vs. the one from @SvenMarnach, I put both approaches into a function. Here fobj is a file opened for reading. I use io.StringIO so IO delays should, hopefully, not be measured:

from collections import Counter

def counter(fobj):
    scores = Counter()
    fobj.seek(0)
    for line in fobj:
        key, score = line.split()
        scores.update({key: int(score)})
    return scores.most_common()

from collections import defaultdict
from operator import itemgetter

def default(fobj):
    scores = defaultdict(int)
    fobj.seek(0)
    for line in fobj:
        name, score = line.split()
        scores[name] += int(score)
    return sorted(scores.items(), key=itemgetter(1), reverse=True)

Results for collections.Counter:

%timeit counter(fobj)
10000 loops, best of 3: 59.1 µs per loop

Results for collections.defaultdict:

%timeit default(fobj)
10000 loops, best of 3: 15.8 µs per loop

Looks like defaultdictis four times faster. I would not have guessed this. But when it comes to performance you need to measure.

Sven Marnach · Answer

This is a good use case for collections.Counter:

from collections import Counter

scores = Counter()
with open('my_file') as f:
    for line in f:
        key, score = line.split()
        scores.update({key: int(score)})

for key, score in scores.most_common():
    print(key, score)

Efficiently processing data in text file

Tags:

python

file

Nigel Overmars

2 Answers

Performance

Mike Müller

Sven Marnach

Recent Activity

Donate For Us

Efficiently processing data in text file

Tags:

python

file

Nigel Overmars

2 Answers

Performance

Mike Müller

Sven Marnach

Related questions

Recent Activity

Donate For Us