Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fastest algorithm for finding overlap between two very large lists?

I'm trying to build an algorithm in Python to filter a large block of RDF data.

I have one list consisting of about 70 thousand items formatted like <"datum">.

I then have about 6GB worth of items (triples) formatted like <"A"> <"B"> <"C">

I want to extract all the triples that contain any item in the first list, and then extract any triples that contain any individual item from the first extraction (the net effect is to form a partition of the graph that's connected by one step to the seeds from the first list).

I haven't been able to come up with a great algorithm for this (not helped by the fact that I have no formal CS training.)

The best I've come up with so far is to start by splitting the triples in the big list into a list of three item lists [<"A">, <"B">, <"C">]. I then split that into chunks, and use multiprocessing to create processes that take the full small list and a chunk of the big list and...

for line in big list:
    for item in small list:
      if item in line:
       bucket.append(line)

This algorithm takes quite a while.

Is there any faster way to do this? If there's a specific algorithm, you can just give me the name and I'll figure out how to implement it.

Thanks!

Clarifications per comments:

  1. All the data items are strings. So small list might contain ["Mickey", "Mouse", "Minny", "Cat"] and big list might be [["Mickey","Pluto","Bluto"],["John", "Jane", "Jim]...]

  2. Only one item in each big list triple needs to match an item for the small list for that to count

  3. All of the items in the small list are actually unique, so I didn't think to convert them to a set anyway. But I will try that.

  4. I can create whatever intermediate structures I want. I'm experimenting with an inverted index constructed using a shelve right now.

like image 532
rogueleaderr Avatar asked May 03 '12 02:05

rogueleaderr


1 Answers

You should probably first store the small list in a set, so lookup is faster. This prevents going through 70,000 iterations for every item in big_list.

small_list_set = set(small_list)
for line in big_list:
    for item in line:
        if item in small_list_set:
            bucket.append(line)
like image 136
happydave Avatar answered Sep 28 '22 08:09

happydave