How to speed up 4 million set intersections?

Question

I'm an inexperienced programmer working through a number of bioinformatics exercises in Python.

One problem area counts elements in the set intersection between name groups, and stores that count in a dictionary. There are two lists of 2000 name groups each; names in the name groups are Latin names of species. For example:

list__of_name_groups_1 = [
    ['Canis Lupus', 'Canis Latrans'],
    ['Euarctos Americanus', 'Lynx Rufus'],
    ...
]
list__of_name_groups_2 = [
    ['Nasua Narica', 'Odocoileus Hemionus'],
    ['Felis Concolor', 'Peromyscus Eremicus'],
    ['Canis Latrans', 'Cervus Canadensis']
    ...
]

And I need a dictionary that contains all intersection sizes between the name groups, e.g.

>>> intersections
{ (0, 0): 0, (0, 1): 0, (0, 2): 1, (1, 0): 0, (1, 1): 0, (2, 1): 0,
  (2, 0): 1, (2, 1): 0, (2, 2): 0 }

('Canis Latrans' occurs in element 0 in the first list, element 2 in the second list.)

I've got an implementation of an algorithm that works, but it runs too slowly.

overlap = {}
    for i in list_of_lists_of_names_1:            
        for j in list_of_lists_of_names_2:
            overlap[(i,j)] = len(set(i) & set(j))

Is there a faster way to count the number of elements in set intersections?

(Hello moderators... Nick, this revised post is actually asking a slightly different question than the one I'm working on. While your answer is a very good one for addressing that question, I'm afraid that the method you suggest is actually not useful for what I'm trying to do. I very much appreciate the time and effort you put into your answer, and into editing this post, but I would request that the post be reverted to the original.)

Jerry101 · Accepted Answer

First, Python sets are good at finding intersections (they use hashing), but your code constructs the same sets over and over again. E.g. if the two lists contain 2000 elements each [Did you mean the outer or inner lists are that long?], there are only 4000 different sets to compute but your code computes 2000 x 2000 x 2 = 8 million sets.

So compute those 4000 sets once:

list_of_name_tuples_1 = [("a", "aa"), ("b", "bbb"), ("c", "cc", "ccc")]
list_of_name_tuples_2 = [("a", "AA"), ("b", "BBB"), ("c", "cc", "CCC")]
name_sets_1 = [set(i) for i in list_of_name_tuples_1]
name_sets_2 = [set(i) for i in list_of_name_tuples_2]

overlap = {}
for l1, s1 in zip(list_of_name_tuples_1, name_sets_1):
    for l2, s2 in zip(list_of_name_tuples_2, name_sets_2):
        overlap[(l1, l2)] = len(s1 & s2)

Python lists are unhashable, thus they can't be used for dict keys, so I changed the lists-of-lists-of-names into lists-of-tuples-of-names.

(This code assumes you're using Python 3, where zip() returns an iterator. If you're using Python 2, then call itertools.izip() to get an iterator over the paired elements.)

Second, consider restructuring overlap as a dict of dicts rather than a dict indexed by tuples.

list_of_name_tuples_1 = [("a", "aa"), ("b", "bbb"), ("c", "cc", "ccc")]
list_of_name_tuples_2 = [("a", "AA"), ("b", "BBB"), ("c", "cc", "CCC")]
name_sets_1 = [set(i) for i in list_of_name_tuples_1]
name_sets_2 = [set(i) for i in list_of_name_tuples_2]

overlap = {}
for l1, s1 in zip(list_of_name_tuples_1, name_sets_1):
    d = overlap.setdefault(l1, {})
    for l2, s2 in zip(list_of_name_tuples_2, name_sets_2):
        d[l2] = len(s1 & s2)

This could save a lot of work in the follow-on code which would access it via overlap[l1][l2] instead of overlap[(l1, l2)] (without tuple construction or hash generation), and nested loops could fetch d = overlap[l1] in an outer loop then access d[l2] in an inner loop.

Nick · Answer

This depends on the nature of your data, but may give you a big saving if there are relatively few Latin names common to both super-lists. The method is:

Find common names between the lists
compute set intersections only between name groups that contain one of those common names
The rest of the set intersection counts will be 0.

Your own solution is slow because of the number of set operations you perform: 2,000 x 2,000 == 4,000,000. No matter how quickly Python performs each one, it's going to take time. My method reduces the number of set intersections computed by a factor of 1000, at the expense of some other, smaller calculations.

My back-of-envelope calculation is that you might improve performance by a factor of 4 or better if there are relatively few common names. The improvement will be more the fewer common names there are.

I've used a few things here that may be new to you: list comprehensions and enumerate(), defaultdict, list membership using in and the itertools methods. That chucks you in at the deep end. Happy researching, and let me know if you'd like some explanations.

from collections import defaultdict
import itertools

list_of_name_groups_1 = [
    ['Canis Lupus', 'Canis Latrans'],
    ['Euarctos Americanus', 'Lynx Rufus'],
]
list_of_name_groups_2 = [
    ['Nasua Narica', 'Odocoileus Hemionus'],
    ['Felis Concolor', 'Peromyscus Eremicus'],
    ['Canis Latrans', 'Cervus Canadensis']
]

def flatten(list_of_lists):
    return itertools.chain.from_iterable(list_of_lists)


def unique_names(list_of_name_groups):
    return set(flatten(list_of_name_groups))


def get_matching_name_groups(name, list_of_name_groups):
    return (list_index for list_index, name_group
            in enumerate(list_of_name_groups)
            if name in name_group)

list1_candidates = set()
list2_candidates = set()
common_names = unique_names(list_of_name_groups_1) & unique_names(list_of_name_groups_2)
for name in common_names:
    list1_candidates.update(tuple(get_matching_name_groups(name, list_of_name_groups_1)))
    list2_candidates.update(tuple(get_matching_name_groups(name, list_of_name_groups_2)))
intersections = defaultdict(lambda: 0)
for i, j in itertools.product(list1_candidates, list2_candidates):
        intersections[(i, j)] = len(set(list_of_name_groups_1[i]) & set(list_of_name_groups_2[j]))
print(intersections)

>>> python intersections.py
defaultdict(<function <lambda> at 0x0000000000DC7620>, {(0, 2): 1})

How to speed up 4 million set intersections?

Tags:

python

set

bioinformatics

Brian

2 Answers

Jerry101

Nick

Recent Activity

Donate For Us

How to speed up 4 million set intersections?

Tags:

python

set

bioinformatics

Brian

2 Answers

Jerry101

Nick

Related questions

Recent Activity

Donate For Us