I'm searching for the best algorithm to resolve this problem: having a list (or a dict, a set) of small sentences, find the all occurrences of this sentences in a bigger text. The sentences in the list (or dict, or set) are about 600k but formed, on average, by 3 words. The text is, on average, 25 words long. I've just formatted the text (deleting punctuation, all lowercase and go on like this).
Here is what I have tried out (Python):
to_find_sentences = [
'bla bla',
'have a tea',
'hy i m luca',
'i love android',
'i love ios',
.....
]
text = 'i love android and i think i will have a tea with john'
def find_sentence(to_find_sentences, text):
text = text.split()
res = []
w = len(text)
for i in range(w):
for j in range(i+1,w+1):
tmp = ' '.join(descr[i:j])
if tmp in to_find_sentences:
res.add(tmp)
return res
print find_sentence(to_find_sentence, text)
Out:
['i love android', 'have a tea']
In my case I've used a set to speed up the in
operation
You can use any : a_string = "A string is more than its parts!" matches = ["more", "wholesome", "milk"] if any(x in a_string for x in matches): Similarly to check if all the strings from the list are found, use all instead of any .
String find() in Python Just call the method on the string object to search for a string, like so: obj. find(“search”). The find() method searches for a query string and returns the character position if found. If the string is not found, it returns -1.
Use the string. count() Function to Find All Occurrences of a Substring in a String in Python. The string. count() is an in-built function in Python that returns the quantity or number of occurrences of a substring in a given particular string.
A fast solution would be to build a Trie
out of your sentences and convert this trie to a regex. For your example, the pattern would look like this:
(?:bla\ bla|h(?:ave\ a\ tea|y\ i\ m\ luca)|i\ love\ (?:android|ios))
Here's an example on debuggex:
It might be a good idea to add '\b'
as word boundaries, to avoid matching "have a team"
.
You'll need a small Trie script. It's not an official package yet, but you can simply download it here as trie.py
in your current directory.
You can then use this code to generate the trie/regex:
import re
from trie import Trie
to_find_sentences = [
'bla bla',
'have a tea',
'hy i m luca',
'i love android',
'i love ios',
]
trie = Trie()
for sentence in to_find_sentences:
trie.add(sentence)
print(trie.pattern())
# (?:bla\ bla|h(?:ave\ a\ tea|y\ i\ m\ luca)|i\ love\ (?:android|ios))
pattern = re.compile(r"\b" + trie.pattern() + r"\b", re.IGNORECASE)
text = 'i love android and i think i will have a tea with john'
print(re.findall(pattern, text))
# ['i love android', 'have a tea']
You invest some time to create the Trie and the regex, but the processing should be extremely fast.
Here's a related answer (Speed up millions of regex replacements in Python 3) if you want more information.
Note that it wouldn't find overlapping sentences:
to_find_sentences = [
'i love android',
'android Marshmallow'
]
# ...
print(re.findall(pattern, "I love android Marshmallow"))
# ['I love android']
You'd have to modifiy the regex with positive lookaheads to find overlapping sentences.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With