Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Search a list of strings for any sub-string from another list

Tags:

python

Given these 3 lists of data and a list of keywords:

good_data1 = ['hello, world', 'hey, world']
good_data2 = ['hey, man', 'whats up']
bad_data = ['hi, earth', 'sup, planet']
keywords = ['world', 'he']

I'm trying to write a simple function to check if any of the keywords exist as a substring of any word in the data lists. It should return True for the good_data lists and False for bad_data.

I know how to do this in what seems to be an inefficient way:

def checkData(data):
  for s in data:
    for k in keywords:
      if k in s:
        return True
  return False
like image 673
Jason Coon Avatar asked Apr 14 '09 20:04

Jason Coon


4 Answers

Are you looking for

any( k in s for k in keywords )

It's more compact, but might be less efficient.

like image 72
S.Lott Avatar answered Oct 01 '22 09:10

S.Lott


In your example, with so few items, it doesn't really matter. But if you have a list of several thousand items, this might help.

Since you don't care which element in the list contains the keyword, you can scan the whole list once (as one string) instead of one item at the time. For that you need a join character that you know won't occur in the keyword, in order to avoid false positives. I use the newline in this example.

def check_data(data):
    s = "\n".join(data);
    for k in keywords:
        if k in s:
            return True

    return False

In my completely unscientific test, my version checked a list of 5000 items 100000 times in about 30 seconds. I stopped your version after 3 minutes -- got tired of waiting to post =)

like image 20
gnud Avatar answered Oct 01 '22 10:10

gnud


If you have many keywords, you might want to try a suffix tree [1]. Insert all the words from the three data lists, storing which list each word comes from in it's terminating node. Then you can perform queries on the tree for each keyword really, really fast.

Warning: suffix trees are very complicated to implement!

[1] http://en.wikipedia.org/wiki/Suffix_tree

like image 40
moinudin Avatar answered Oct 01 '22 09:10

moinudin


You may be able to improve matters by building your list of keywords as a regular expression.

This may allow them to be tested in parallel, but will very much depend on what the keywords are (eg. some work may be reused testing for "hello" and "hell", rather than searching every phrase from the start for each word.

You could do this by executing:

import re
keyword_re = re.compile("|".join(map(re.escape, keywords)))

Then:

>>> bool(keyword_re.search('hello, world'))
True
>>> bool(keyword_re.search('hi, earth'))
False

(It will actually return a match object on found, and None if not found - this might be useful if you need to know which keyword matched)

However, how much (if anything) this gains you will depend on the keywords. If you only have one or two, keep your current approach. If you have a large list, it may be worth tring and profiling to see which performs better.

[Edit] For reference, here's how the approaches do for your example:

               good1   good2  good3   bad1   bad2
original     : 0.206   0.233  0.229   0.390   63.879
gnud (join)  : 0.257   0.347  4.600   0.281    6.706
regex        : 0.766   1.018  0.397   0.764  124.351
regex (join) : 0.345   0.337  3.305   0.481   48.666

Obviously for this case, your approach performs far better than the regex one. Whether this will always be the case depends a lot on the number and complexity of keywords, and the input data that will be checked. For large numbers of keywords, and lengthy lists or rarely matching phrases, regexes may work better, but do get timing information, and perhaps try even simpler optimisations (like moving the most common words to the front of your keyword list) first. Sometimes the simplest approach really is the best.

[Edit2] Updated the table with gnud's solution, and a similar approach before applying the regexes. I also added 2 new tests:

good_data3 = good_data2 * 500  # 1000 items, the first of which matches.
bad_data2 = bad_data * 500     # 1000 items, none of which matches.

Which show up the various strengths and weaknesses. Joining does do worse when a match would immediately be found (as there is an always paid, up-front cost in joining the list - this is a best possible case for the linear search method), however for non-matching lists, it performs better. Much better when there are a large number of items in the list.case).

like image 32
Brian Avatar answered Oct 01 '22 10:10

Brian