Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find lots of string in text - Python

Tags:

python

string

I'm searching for the best algorithm to resolve this problem: having a list (or a dict, a set) of small sentences, find the all occurrences of this sentences in a bigger text. The sentences in the list (or dict, or set) are about 600k but formed, on average, by 3 words. The text is, on average, 25 words long. I've just formatted the text (deleting punctuation, all lowercase and go on like this).

Here is what I have tried out (Python):

to_find_sentences = [
    'bla bla',
    'have a tea',
    'hy i m luca',
    'i love android',
    'i love ios',
    .....
]

text = 'i love android and i think i will have a tea with john'

def find_sentence(to_find_sentences, text):
    text = text.split()
    res = []
    w = len(text)
    for i in range(w):
        for j in range(i+1,w+1):
            tmp = ' '.join(descr[i:j])
            if tmp in to_find_sentences:
                res.add(tmp)
    return res


print find_sentence(to_find_sentence, text)

Out:

['i love android', 'have a tea']

In my case I've used a set to speed up the in operation

like image 932
Luca Di Liello Avatar asked Apr 26 '17 08:04

Luca Di Liello


People also ask

How do you search for multiple strings in a string in Python?

You can use any : a_string = "A string is more than its parts!" matches = ["more", "wholesome", "milk"] if any(x in a_string for x in matches): Similarly to check if all the strings from the list are found, use all instead of any .

How do I find a piece of text in a string in Python?

String find() in Python Just call the method on the string object to search for a string, like so: obj. find(“search”). The find() method searches for a query string and returns the character position if found. If the string is not found, it returns -1.

How do you find all occurrences of string in a string Python?

Use the string. count() Function to Find All Occurrences of a Substring in a String in Python. The string. count() is an in-built function in Python that returns the quantity or number of occurrences of a substring in a given particular string.


1 Answers

A fast solution would be to build a Trie out of your sentences and convert this trie to a regex. For your example, the pattern would look like this:

(?:bla\ bla|h(?:ave\ a\ tea|y\ i\ m\ luca)|i\ love\ (?:android|ios))

Here's an example on debuggex:

enter image description here

It might be a good idea to add '\b' as word boundaries, to avoid matching "have a team".

You'll need a small Trie script. It's not an official package yet, but you can simply download it here as trie.py in your current directory.

You can then use this code to generate the trie/regex:

import re
from trie import Trie

to_find_sentences = [
    'bla bla',
    'have a tea',
    'hy i m luca',
    'i love android',
    'i love ios',
]

trie = Trie()
for sentence in to_find_sentences:
    trie.add(sentence)

print(trie.pattern())
# (?:bla\ bla|h(?:ave\ a\ tea|y\ i\ m\ luca)|i\ love\ (?:android|ios))

pattern = re.compile(r"\b" + trie.pattern() + r"\b", re.IGNORECASE)
text = 'i love android and i think i will have a tea with john'

print(re.findall(pattern, text))
# ['i love android', 'have a tea']

You invest some time to create the Trie and the regex, but the processing should be extremely fast.

Here's a related answer (Speed up millions of regex replacements in Python 3) if you want more information.

Note that it wouldn't find overlapping sentences:

to_find_sentences = [
    'i love android',
    'android Marshmallow'
]
# ...
print(re.findall(pattern, "I love android Marshmallow"))
# ['I love android']

You'd have to modifiy the regex with positive lookaheads to find overlapping sentences.

like image 78
Eric Duminil Avatar answered Oct 17 '22 00:10

Eric Duminil