Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Double list comprehension for occurrences of a string in a list of strings

I have two lists of lists:

text = [['hello this is me'], ['oh you know u']]
phrases = [['this is', 'u'], ['oh you', 'me']]

I need to split the text making word combinations present in phrases a single string:

result = [['hello', 'this is', 'me'], ['oh you', 'know', 'u']

I tried using zip() but it iterates through the list consecutively, while I need to check each and every list. I also tried a find() method but from this example it would also find all letters 'u' and make them a string (like in word 'you' it makes it 'yo', 'u'). I wish replace() worked when replacing a string with a list too, because it would let me do something like:

for line in text:
        line = line.replace('this is', ['this is'])

But trying everything, I still haven't found anything that works for me in this situation. Can you help me with that?

like image 293
kat678 Avatar asked Jan 24 '21 02:01

kat678


1 Answers

Clarified with original poster:

Given the text pack my box with five dozen liquor jugs and the phrase five dozen

the result should be:

['pack', 'my', 'box', 'with', 'five dozen', 'liquor', 'jugs']

not:

['pack my box with', 'five dozen', 'liquor jugs']

Each text and phrase is converted to a Python list of words ['this', 'is', 'an', 'example'] which prevents 'u' being matched inside a word.

All possible subphrases of the text are generated by compile_subphrases(). Longer phrases (more words) are generated first so they are matched before shorter ones. 'five dozen jugs' would always be matched in preference to 'five dozen' or 'five'.

Phrase and subphrase are compared using list slices, roughly like this:

    text = ['five', 'dozen', 'liquor', 'jugs']
    phrase = ['liquor', 'jugs']
    if text[2:3] == phrase:
        print('matched')

Using this method for comparing phrases, the script walks through the original text, rewriting it with the phrases picked out.

texts = [['hello this is me'], ['oh you know u']]
phrases_to_match = [['this is', 'u'], ['oh you', 'me']]
from itertools import chain

def flatten(list_of_lists):
    return list(chain(*list_of_lists))

def compile_subphrases(text, minwords=1, include_self=True):
    words = text.split()
    text_length = len(words)
    max_phrase_length = text_length if include_self else text_length - 1
    # NOTE: longest phrases first
    for phrase_length in range(max_phrase_length + 1, minwords - 1, -1):
        n_length_phrases = (' '.join(words[r:r + phrase_length])
                            for r in range(text_length - phrase_length + 1))
        yield from n_length_phrases
        
def match_sublist(mainlist, sublist, i):
    if i + len(sublist) > len(mainlist):
        return False
    return sublist == mainlist[i:i + len(sublist)]

phrases_to_match = list(flatten(phrases_to_match))
texts = list(flatten(texts))
results = []
for raw_text in texts:
    print(f"Raw text: '{raw_text}'")
    matched_phrases = [
        subphrase.split()
        for subphrase
        in compile_subphrases(raw_text)
        if subphrase in phrases_to_match
    ]
    phrasal_text = []
    index = 0
    text_words = raw_text.split()
    while index < len(text_words):
        for matched_phrase in matched_phrases:
            if match_sublist(text_words, matched_phrase, index):
                phrasal_text.append(' '.join(matched_phrase))
                index += len(matched_phrase)
                break
        else:
            phrasal_text.append(text_words[index])
            index += 1
    results.append(phrasal_text)
print(f'Phrases to match: {phrases_to_match}')
print(f"Results: {results}")

Results:

$python3 main.py
Raw text: 'hello this is me'
Raw text: 'oh you know u'
Phrases to match: ['this is', 'u', 'oh you', 'me']
Results: [['hello', 'this is', 'me'], ['oh you', 'know', 'u']]

For testing this and other answers with larger datasets, try this at the start of the code. It generates 100s of variations on a single long sentence to simulate 100s of texts.

from itertools import chain, combinations
import random

#texts = [['hello this is me'], ['oh you know u']]
theme = ' '.join([
    'pack my box with five dozen liquor jugs said',
    'the quick brown fox as he jumped over the lazy dog'
])
variations = list([
    ' '.join(combination)
    for combination
    in combinations(theme.split(), 5)
])
texts = random.choices(variations, k=500)
#phrases_to_match = [['this is', 'u'], ['oh you', 'me']]
phrases_to_match = [
    ['pack my box', 'quick brown', 'the quick', 'brown fox'],
    ['jumped over', 'lazy dog'],
    ['five dozen', 'liquor', 'jugs']
]
like image 85
Nick Avatar answered Sep 28 '22 08:09

Nick