I have a multi column (13 columns) space separated file (some 5 million+ lines), going like this:
1. W5 403 407 P Y 2 2 PR 22 PNIYR 22222 12.753 13.247
2. W5 404 408 N V 2 2 PR 22 PIYYR 22222 13.216 13.247
3. W3 274 276 E G 1 1 EG 11 EPG 121 6.492 6.492
4. W3 275 277 P R 2 1 PR 21 PGR 211 6.365 7.503
5. W3 276 278 G Y 1 1 GY 11 GRY 111 5.479 5.479
6. W3 46 49 G L 1 1 GY 11 GRY 111 5.176 5.176
7. W4 47 50 D K 1 1 DK 11 DILK 1111 4.893 5.278
8. W4 48 51 I K 1 1 IK 11 ILKK 1111 4.985 5.552
etc., etc.,
I'm interested in 2 of these columns (col 8 & 11) and want to count the number of occurrences of particular pairs (col 8) with the strings that follow (in col 11).
Ex., reference key 'GY' : # of occurrences of '111' : 2 key 'PR' : # of occurrences of '22222': 2 key 'DK' : # of occurrences of '1111' :1 key 'EG' : # of occurrences of '121': 1
I have a dict based basic implementation of it.
countshash={}
for l in bigtable:
cont = l.split()
if cont[7] not in countshash: countshash[cont[7]] = {}
if cont[11] not in countshash[cont[7]]: countshash[cont[7]][cont[10]] = 0
countshash[cont[7]][cont[10]]+= 1;
I also have a simple awk based counting (which is super-fast) but was wondering about an efficient & faster way to do this in python. Thanks for your inputs.
I'm not sure if this will help with speed, but you are creating a ton of defaultdict
-like objects, which I think you can make a bit more readable:
from collections import defaultdict
countshash = defaultdict(lambda: defaultdict(int))
for l in bigtable:
cont = l.split()
countshash[cont[7]][cont[10]] += 1
from collections import Counter
Counter(tuple(row.split()[8:12:3]) for row in bigtable)
using itemgetter
is more flexible and may be more efficient than slicing
from operator import itemgetter
ig = itemgetter(8, 11)
Counter(ig(row.split()) for row in bigtable)
using imap
can make things a tiny big faster too
from itertools import imap
Counter(imap(ig, imap(str.split, bigtable)))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With