Problem
Given a list of string find the strings from the list that appear in the given text.
Example
list = ['red', 'hello', 'how are you', 'hey', 'deployed']
text = 'hello, This is shared right? how are you doing tonight'
result = ['red', 'how are you', 'hello']
'red' because it has 'shared' has 'red' as a substring
Solution
Psuedo-Code
def FindWord (trie, text, word_so_far, index):
index > len(text)
return
//Check if the word_so_far is a prefix of a key; if not return
if trie.has_subtrie(word) == false:
return
//Check if the word_so_far is a key; if ye add to result and look further
if trie.has_key(word) == false:
// Add to result and continue
//extend the current word we are searching
FindWord (trie, text, word_so_far + text[index], index + 1)
//start new from the next index
FindWord (trie, text, "", index + 1)
The problem with this is although the runtime now depends on the len(text)
it runs with a time complexity O(2^n)
after building the trie which is a one time thing for multiple texts so it's fine.
I do not see any overlapping subproblems either to memoize and improve the runtime.
Can you suggest any way I can achieve a runtime that depends on given text as opposed to the list of words which can be per-processed and cached and is also faster that this.
Using any() to check if string contains element from list. Using any function is the most classical way in which you can perform this task and also efficiently. This function checks for match in string with match of each element of list.
Hold the Ctrl keyboard key and press the F keyboard key (Ctrl+F) or right-click (click the right mouse button) somewhere on the article and select Find (in this article).
Python String split() MethodThe split() method splits a string into a list. You can specify the separator, default separator is any whitespace. Note: When maxsplit is specified, the list will contain the specified number of elements plus one.
The simplest approach provided by Python to convert the given list of Sentences into words with separate indices is to use split() method. This method split a string into a list where each word is a list item.
The theoretically sound version of what you're trying to do is called Aho--Corasick. Implementing the suffix links is somewhat complicated IIRC, so here's an algorithm that just uses the trie.
We consume the text letter by letter. At all times, we maintain a set of nodes in the trie where the traversal can be. Initially this set consists of just the root node. For each letter, we loop through the nodes in the set, descending via the new letter if possible. If the resulting node is a match, great, report it. Regardless, put it in the next set. The next set also contains the root node, since we can start a new match at any time.
Here's my attempt at a quick implementation in Python (untested, no warranty, etc.).
class Trie:
def __init__(self):
self.is_needle = False
self._children = {}
def find(self, text):
node = self
for c in text:
node = node._children.get(c)
if node is None:
break
return node
def insert(self, needle):
node = self
for c in needle:
node = node._children.setdefault(c, Trie())
node.is_needle = True
def count_matches(needles, text):
root = Trie()
for needle in needles:
root.insert(needle)
nodes = [root]
count = 0
for c in text:
next_nodes = [root]
for node in nodes:
next_node = node.find(c)
if next_node is not None:
count += next_node.is_needle
next_nodes.append(next_node)
nodes = next_nodes
return count
print(
count_matches(['red', 'hello', 'how are you', 'hey', 'deployed'],
'hello, This is shared right? how are you doing tonight'))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With