Find lots of string in text - Python

Tags:

I'm searching for the best algorithm to resolve this problem: having a list (or a dict, a set) of small sentences, find the all occurrences of this sentences in a bigger text. The sentences in the list (or dict, or set) are about 600k but formed, on average, by 3 words. The text is, on average, 25 words long. I've just formatted the text (deleting punctuation, all lowercase and go on like this).

Here is what I have tried out (Python):

to_find_sentences = [
    'bla bla',
    'have a tea',
    'hy i m luca',
    'i love android',
    'i love ios',
    .....
]

text = 'i love android and i think i will have a tea with john'

def find_sentence(to_find_sentences, text):
    text = text.split()
    res = []
    w = len(text)
    for i in range(w):
        for j in range(i+1,w+1):
            tmp = ' '.join(descr[i:j])
            if tmp in to_find_sentences:
                res.add(tmp)
    return res


print find_sentence(to_find_sentence, text)

Out:

['i love android', 'have a tea']

In my case I've used a set to speed up the in operation

932

asked Apr 26 '17 08:04

Luca Di Liello

1 Answers

A fast solution would be to build a Trie out of your sentences and convert this trie to a regex. For your example, the pattern would look like this:

(?:bla\ bla|h(?:ave\ a\ tea|y\ i\ m\ luca)|i\ love\ (?:android|ios))

Here's an example on debuggex:

enter image description here

It might be a good idea to add '\b' as word boundaries, to avoid matching "have a team".

You'll need a small Trie script. It's not an official package yet, but you can simply download it here as trie.py in your current directory.

You can then use this code to generate the trie/regex:

import re
from trie import Trie

to_find_sentences = [
    'bla bla',
    'have a tea',
    'hy i m luca',
    'i love android',
    'i love ios',
]

trie = Trie()
for sentence in to_find_sentences:
    trie.add(sentence)

print(trie.pattern())
# (?:bla\ bla|h(?:ave\ a\ tea|y\ i\ m\ luca)|i\ love\ (?:android|ios))

pattern = re.compile(r"\b" + trie.pattern() + r"\b", re.IGNORECASE)
text = 'i love android and i think i will have a tea with john'

print(re.findall(pattern, text))
# ['i love android', 'have a tea']

You invest some time to create the Trie and the regex, but the processing should be extremely fast.

Here's a related answer (Speed up millions of regex replacements in Python 3) if you want more information.

Note that it wouldn't find overlapping sentences:

to_find_sentences = [
    'i love android',
    'android Marshmallow'
]
# ...
print(re.findall(pattern, "I love android Marshmallow"))
# ['I love android']

You'd have to modifiy the regex with positive lookaheads to find overlapping sentences.

answered Oct 17 '22 00:10

Eric Duminil

Related questions
                            
                                How to flip a byte in python?
                            
                                Set order of columns in DynamoDB table of AWS
                            
                                Center x-axis labels in line plot
                            
                                Enabling SSL on Flask + Google App Engine
                            
                                In matplotlib 2.0, how do I revert colorbar behaviour to that of matplotlib 1.5?
                            
                                Understanding lstm input shape in keras with different sequence
                            
                                Fitting a Lognormal Distribution in Python using CURVE_FIT
                            
                                Python , variable store in memory
                            
                                Trying load a pandas dataframe into Flask session and use that throughout the session
                            
                                Python string in...in syntax
                            
                                pandas: map to new column, excluding some codes
                            
                                Python Pandas plots layer order changed by secondary_y
                            
                                How to pull notebooks from github to google cloud datalab?
                            
                                Custom logarithmic axis scaling in matplotlib
                            
                                Add padding to images to get them into the same shape
                            
                                Python warnings come after thing trying to warn user about
                            
                                pandas: map multiple columns to one column
                            
                                Do the individual Series contained within a DataFrame maintain their own index?
                            
                                Seaborn countplot set legend for x values
                            
                                Early stopping with tf.estimator, how?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Find lots of string in text - Python

Tags:

python

string

Luca Di Liello

People also ask

1 Answers

Eric Duminil

Recent Activity

Donate For Us