I'm an inexperienced programmer working through a number of bioinformatics exercises in Python.
One problem area counts elements in the set intersection between name groups, and stores that count in a dictionary. There are two lists of 2000 name groups each; names in the name groups are Latin names of species. For example:
list__of_name_groups_1 = [
['Canis Lupus', 'Canis Latrans'],
['Euarctos Americanus', 'Lynx Rufus'],
...
]
list__of_name_groups_2 = [
['Nasua Narica', 'Odocoileus Hemionus'],
['Felis Concolor', 'Peromyscus Eremicus'],
['Canis Latrans', 'Cervus Canadensis']
...
]
And I need a dictionary that contains all intersection sizes between the name groups, e.g.
>>> intersections
{ (0, 0): 0, (0, 1): 0, (0, 2): 1, (1, 0): 0, (1, 1): 0, (2, 1): 0,
(2, 0): 1, (2, 1): 0, (2, 2): 0 }
('Canis Latrans'
occurs in element 0
in the first list, element 2
in the second list.)
I've got an implementation of an algorithm that works, but it runs too slowly.
overlap = {}
for i in list_of_lists_of_names_1:
for j in list_of_lists_of_names_2:
overlap[(i,j)] = len(set(i) & set(j))
Is there a faster way to count the number of elements in set intersections?
(Hello moderators... Nick, this revised post is actually asking a slightly different question than the one I'm working on. While your answer is a very good one for addressing that question, I'm afraid that the method you suggest is actually not useful for what I'm trying to do. I very much appreciate the time and effort you put into your answer, and into editing this post, but I would request that the post be reverted to the original.)
First, Python set
s are good at finding intersections (they use hashing), but your code constructs the same set
s over and over again. E.g. if the two list
s contain 2000 elements each [Did you mean the outer or inner list
s are that long?], there are only 4000 different set
s to compute but your code computes 2000 x 2000 x 2 = 8 million set
s.
So compute those 4000 sets once:
list_of_name_tuples_1 = [("a", "aa"), ("b", "bbb"), ("c", "cc", "ccc")]
list_of_name_tuples_2 = [("a", "AA"), ("b", "BBB"), ("c", "cc", "CCC")]
name_sets_1 = [set(i) for i in list_of_name_tuples_1]
name_sets_2 = [set(i) for i in list_of_name_tuples_2]
overlap = {}
for l1, s1 in zip(list_of_name_tuples_1, name_sets_1):
for l2, s2 in zip(list_of_name_tuples_2, name_sets_2):
overlap[(l1, l2)] = len(s1 & s2)
Python list
s are unhashable, thus they can't be used for dict
keys, so I changed the lists-of-lists-of-names into lists-of-tuples-of-names.
(This code assumes you're using Python 3, where zip()
returns an iterator. If you're using Python 2, then call itertools.izip()
to get an iterator over the paired elements.)
Second, consider restructuring overlap
as a dict
of dict
s rather than a dict
indexed by tuples.
list_of_name_tuples_1 = [("a", "aa"), ("b", "bbb"), ("c", "cc", "ccc")]
list_of_name_tuples_2 = [("a", "AA"), ("b", "BBB"), ("c", "cc", "CCC")]
name_sets_1 = [set(i) for i in list_of_name_tuples_1]
name_sets_2 = [set(i) for i in list_of_name_tuples_2]
overlap = {}
for l1, s1 in zip(list_of_name_tuples_1, name_sets_1):
d = overlap.setdefault(l1, {})
for l2, s2 in zip(list_of_name_tuples_2, name_sets_2):
d[l2] = len(s1 & s2)
This could save a lot of work in the follow-on code which would access it via overlap[l1][l2]
instead of overlap[(l1, l2)]
(without tuple construction or hash generation), and nested loops could fetch d = overlap[l1]
in an outer loop then access d[l2]
in an inner loop.
This depends on the nature of your data, but may give you a big saving if there are relatively few Latin names common to both super-lists. The method is:
0
.Your own solution is slow because of the number of set operations you perform: 2,000 x 2,000 == 4,000,000. No matter how quickly Python performs each one, it's going to take time. My method reduces the number of set intersections computed by a factor of 1000, at the expense of some other, smaller calculations.
My back-of-envelope calculation is that you might improve performance by a factor of 4 or better if there are relatively few common names. The improvement will be more the fewer common names there are.
I've used a few things here that may be new to you: list comprehensions and enumerate()
, defaultdict
, list membership using in
and the itertools methods. That chucks you in at the deep end. Happy researching, and let me know if you'd like some explanations.
from collections import defaultdict
import itertools
list_of_name_groups_1 = [
['Canis Lupus', 'Canis Latrans'],
['Euarctos Americanus', 'Lynx Rufus'],
]
list_of_name_groups_2 = [
['Nasua Narica', 'Odocoileus Hemionus'],
['Felis Concolor', 'Peromyscus Eremicus'],
['Canis Latrans', 'Cervus Canadensis']
]
def flatten(list_of_lists):
return itertools.chain.from_iterable(list_of_lists)
def unique_names(list_of_name_groups):
return set(flatten(list_of_name_groups))
def get_matching_name_groups(name, list_of_name_groups):
return (list_index for list_index, name_group
in enumerate(list_of_name_groups)
if name in name_group)
list1_candidates = set()
list2_candidates = set()
common_names = unique_names(list_of_name_groups_1) & unique_names(list_of_name_groups_2)
for name in common_names:
list1_candidates.update(tuple(get_matching_name_groups(name, list_of_name_groups_1)))
list2_candidates.update(tuple(get_matching_name_groups(name, list_of_name_groups_2)))
intersections = defaultdict(lambda: 0)
for i, j in itertools.product(list1_candidates, list2_candidates):
intersections[(i, j)] = len(set(list_of_name_groups_1[i]) & set(list_of_name_groups_2[j]))
print(intersections)
>>> python intersections.py
defaultdict(<function <lambda> at 0x0000000000DC7620>, {(0, 2): 1})
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With