Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fast String within List Searching

Using Python 3, I have a list containing over 100,000 strings (list1), each 300 characters long at most. I also have a list of over 9 million substrings (list2)--I want to count how many elements a substring in list2 appears in. For instance,

list1 = ['cat', 'caa', 'doa', 'oat']
list2 = ['at', 'ca', 'do']

I want the function to return (mapped to list2):

[2, 2, 1]

Normally, this is very simple and requires very little. However, due to the massive size of the lists, I have efficiency issues. I want to find the fastest way to return that counter list.

I've tried list comprehensions, generators, maps, loops of all kinds and I have yet to find an fast way to do this easy task. What is theoretically the fastest way to complete this objective, preferably taking O(len(list2)) steps very quickly?

like image 531
user1104160 Avatar asked Dec 18 '11 04:12

user1104160


1 Answers

set M = len(list1) and N = len(list2)

For each of the N entries in list2 you're going to have to do M comparisons against the entries in list1. That's a worst case running time of O(M x N). If you take it further, lets take each entry in list2 to be of length 1 and each entry in list1 to be of length 300, then you got a running time of O(300M x N).

If performance is really an issue, try dynamic programming. Here is a start:

1) sort list2 in ascending order of length like so:

['scorch', 'scorching', 'dump', 'dumpster', 'dumpsters']

2) sort them into sublists such that each preceeding entry is a subset of the proceeding entry like so:

[['scorch', 'scorching'] , ['dump', 'dumpster', 'dumpsters']]

3) Now if you compare against list1 and 'scorch' is not in there, then you don't have to search for 'scorching' either. Likewise if 'dump' is not in there, neither is 'dumpster' or 'dumpsters'

note the worst case run time is still the same

like image 84
puk Avatar answered Sep 21 '22 09:09

puk