Problem Given a list of string find the strings from the list that appear in the given text. Example <pre class="prettyprint"><code>list = ['red', 'hello', 'how are you', 'hey', 'deployed'] text = 'hello, This is shared right? how are you doing tonight' result = ['red', 'how are you', 'hello'] </code></pre> 'red' because it has 'shared' has 'red' as a substring <ul> <li>This is very similar to this question except that the word we need to look for can be substring as well. </li> <li>The list is pretty large and increases with increase in the users as opposed to text which is pretty much the same length throughout.</li> <li>I was thinking of having a solution where the time complexity depends on from the length of the text rather that list of words so that it can be scalable even when lots of users are added. </li> </ul> Solution <ul> <li>I build a trie from the give list of words </li> <li>Run dfs on the text and check the current word against the trie</li> </ul> Psuedo-Code <pre class="prettyprint"><code>def FindWord (trie, text, word_so_far, index): index > len(text) return //Check if the word_so_far is a prefix of a key; if not return if trie.has_subtrie(word) == false: return //Check if the word_so_far is a key; if ye add to result and look further if trie.has_key(word) == false: // Add to result and continue //extend the current word we are searching FindWord (trie, text, word_so_far + text[index], index + 1) //start new from the next index FindWord (trie, text, "", index + 1) </code></pre> The problem with this is although the runtime now depends on the <code>len(text)</code> it runs with a time complexity <code>O(2^n)</code> after building the trie which is a one time thing for multiple texts so it's fine. I do not see any overlapping subproblems either to memoize and improve the runtime. Can you suggest any way I can achieve a runtime that depends on given text as opposed to the list of words which can be per-processed and cached and is also faster that this.

The theoretically sound version of what you're trying to do is called Aho--Corasick. Implementing the suffix links is somewhat complicated IIRC, so here's an algorithm that just uses the trie. We consume the text letter by letter. At all times, we maintain a set of nodes in the trie where the traversal can be. Initially this set consists of just the root node. For each letter, we loop through the nodes in the set, descending via the new letter if possible. If the resulting node is a match, great, report it. Regardless, put it in the next set. The next set also contains the root node, since we can start a new match at any time. Here's my attempt at a quick implementation in Python (untested, no warranty, etc.). <pre class="prettyprint"><code>class Trie: def __init__(self): self.is_needle = False self._children = {} def find(self, text): node = self for c in text: node = node._children.get(c) if node is None: break return node def insert(self, needle): node = self for c in needle: node = node._children.setdefault(c, Trie()) node.is_needle = True def count_matches(needles, text): root = Trie() for needle in needles: root.insert(needle) nodes = [root] count = 0 for c in text: next_nodes = [root] for node in nodes: next_node = node.find(c) if next_node is not None: count += next_node.is_needle next_nodes.append(next_node) nodes = next_nodes return count print( count_matches(['red', 'hello', 'how are you', 'hey', 'deployed'], 'hello, This is shared right? how are you doing tonight')) </code></pre>

Given a list of words and a sentence find all words that appear in the sentence either in whole or as a substring

Tags:

python

algorithm

search

trie

Problem

Given a list of string find the strings from the list that appear in the given text.

Example

list = ['red', 'hello', 'how are you', 'hey', 'deployed']
text = 'hello, This is shared right? how are you doing tonight'
result = ['red', 'how are you', 'hello']

'red' because it has 'shared' has 'red' as a substring

This is very similar to this question except that the word we need to look for can be substring as well.
The list is pretty large and increases with increase in the users as opposed to text which is pretty much the same length throughout.
I was thinking of having a solution where the time complexity depends on from the length of the text rather that list of words so that it can be scalable even when lots of users are added.

Solution

I build a trie from the give list of words
Run dfs on the text and check the current word against the trie

Psuedo-Code

def FindWord (trie, text, word_so_far, index):
    index > len(text)
        return
    //Check if the word_so_far is a prefix of a key; if not return
    if trie.has_subtrie(word) == false:
       return 
    //Check if the word_so_far is a key; if ye add to result and look further 
    if trie.has_key(word) == false:
        // Add to result and continue
    //extend the current word we are searching
    FindWord (trie, text, word_so_far + text[index], index + 1)
    //start new from the next index 
    FindWord (trie, text, "", index + 1)

The problem with this is although the runtime now depends on the len(text) it runs with a time complexity O(2^n) after building the trie which is a one time thing for multiple texts so it's fine.

I do not see any overlapping subproblems either to memoize and improve the runtime.

Can you suggest any way I can achieve a runtime that depends on given text as opposed to the list of words which can be per-processed and cached and is also faster that this.

527

asked Jan 15 '19 16:01

thebenman

1 Answers

The theoretically sound version of what you're trying to do is called Aho--Corasick. Implementing the suffix links is somewhat complicated IIRC, so here's an algorithm that just uses the trie.

We consume the text letter by letter. At all times, we maintain a set of nodes in the trie where the traversal can be. Initially this set consists of just the root node. For each letter, we loop through the nodes in the set, descending via the new letter if possible. If the resulting node is a match, great, report it. Regardless, put it in the next set. The next set also contains the root node, since we can start a new match at any time.

Here's my attempt at a quick implementation in Python (untested, no warranty, etc.).

class Trie:
    def __init__(self):
        self.is_needle = False
        self._children = {}

    def find(self, text):
        node = self
        for c in text:
            node = node._children.get(c)
            if node is None:
                break
        return node

    def insert(self, needle):
        node = self
        for c in needle:
            node = node._children.setdefault(c, Trie())
        node.is_needle = True


def count_matches(needles, text):
    root = Trie()
    for needle in needles:
        root.insert(needle)
    nodes = [root]
    count = 0
    for c in text:
        next_nodes = [root]
        for node in nodes:
            next_node = node.find(c)
            if next_node is not None:
                count += next_node.is_needle
                next_nodes.append(next_node)
        nodes = next_nodes
    return count


print(
    count_matches(['red', 'hello', 'how are you', 'hey', 'deployed'],
                  'hello, This is shared right? how are you doing tonight'))

answered Oct 17 '22 18:10

David Eisenstat

Related questions
                            
                                Python 3.5 vs. 3.6 what made "map" slower compared to comprehensions
                            
                                python-ldap add_s fails to add attribute for AD user with OBJECT_CLASS_VIOLATION
                            
                                'io.h': No such file or directory during "pip install netifaces"
                            
                                What's the correct way to compute a confusion matrix for object detection?
                            
                                Python - "xor"ing each byte in "bytes" in the most efficient way
                            
                                VS Code run ipython in debug console
                            
                                How to use tensorflow debugging tool tfdbg on tf.estimator in Tensorflow?
                            
                                Python install error "A newer version of the Python launcher is already installed"
                            
                                Detecting paste in python
                            
                                Bundling Python3 packages for PySpark results in missing imports
                            
                                Training a RNN to output word2vec embedding instead of logits
                            
                                How to unit test Neo4j in python ?
                            
                                How to pass the arguments to the new_callable from mock.patch?
                            
                                Remove error from mypy for attributes set dynamically in a Python class
                            
                                combine several GIF horizontally - python
                            
                                Error while installing debian packages programmitically using apt_pkg
                            
                                How to create Keras model with optional inputs
                            
                                How can I wait until I receive data using a Python socket?
                            
                                vscode python remote interpreter
                            
                                Pip install - do downloaded whl files persist & take disk space?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With