I have two lists of lists:
text = [['hello this is me'], ['oh you know u']]
phrases = [['this is', 'u'], ['oh you', 'me']]
I need to split the text making word combinations present in phrases a single string:
result = [['hello', 'this is', 'me'], ['oh you', 'know', 'u']
I tried using zip() but it iterates through the list consecutively, while I need to check each and every list. I also tried a find() method but from this example it would also find all letters 'u' and make them a string (like in word 'you' it makes it 'yo', 'u'). I wish replace() worked when replacing a string with a list too, because it would let me do something like:
for line in text:
line = line.replace('this is', ['this is'])
But trying everything, I still haven't found anything that works for me in this situation. Can you help me with that?
Clarified with original poster:
Given the text
pack my box with five dozen liquor jugs
and the phrase five dozen
the result should be:
['pack', 'my', 'box', 'with', 'five dozen', 'liquor', 'jugs']
not:
['pack my box with', 'five dozen', 'liquor jugs']
Each text and phrase is converted to a Python list of words ['this', 'is', 'an', 'example']
which prevents 'u' being matched inside a word.
All possible subphrases of the text are generated by compile_subphrases()
.
Longer phrases (more words) are generated first so they are matched before shorter ones. 'five dozen jugs'
would always be matched in preference to 'five dozen'
or 'five'
.
Phrase and subphrase are compared using list slices, roughly like this:
text = ['five', 'dozen', 'liquor', 'jugs']
phrase = ['liquor', 'jugs']
if text[2:3] == phrase:
print('matched')
Using this method for comparing phrases, the script walks through the original text, rewriting it with the phrases picked out.
texts = [['hello this is me'], ['oh you know u']]
phrases_to_match = [['this is', 'u'], ['oh you', 'me']]
from itertools import chain
def flatten(list_of_lists):
return list(chain(*list_of_lists))
def compile_subphrases(text, minwords=1, include_self=True):
words = text.split()
text_length = len(words)
max_phrase_length = text_length if include_self else text_length - 1
# NOTE: longest phrases first
for phrase_length in range(max_phrase_length + 1, minwords - 1, -1):
n_length_phrases = (' '.join(words[r:r + phrase_length])
for r in range(text_length - phrase_length + 1))
yield from n_length_phrases
def match_sublist(mainlist, sublist, i):
if i + len(sublist) > len(mainlist):
return False
return sublist == mainlist[i:i + len(sublist)]
phrases_to_match = list(flatten(phrases_to_match))
texts = list(flatten(texts))
results = []
for raw_text in texts:
print(f"Raw text: '{raw_text}'")
matched_phrases = [
subphrase.split()
for subphrase
in compile_subphrases(raw_text)
if subphrase in phrases_to_match
]
phrasal_text = []
index = 0
text_words = raw_text.split()
while index < len(text_words):
for matched_phrase in matched_phrases:
if match_sublist(text_words, matched_phrase, index):
phrasal_text.append(' '.join(matched_phrase))
index += len(matched_phrase)
break
else:
phrasal_text.append(text_words[index])
index += 1
results.append(phrasal_text)
print(f'Phrases to match: {phrases_to_match}')
print(f"Results: {results}")
Results:
$python3 main.py
Raw text: 'hello this is me'
Raw text: 'oh you know u'
Phrases to match: ['this is', 'u', 'oh you', 'me']
Results: [['hello', 'this is', 'me'], ['oh you', 'know', 'u']]
For testing this and other answers with larger datasets, try this at the start of the code. It generates 100s of variations on a single long sentence to simulate 100s of texts.
from itertools import chain, combinations
import random
#texts = [['hello this is me'], ['oh you know u']]
theme = ' '.join([
'pack my box with five dozen liquor jugs said',
'the quick brown fox as he jumped over the lazy dog'
])
variations = list([
' '.join(combination)
for combination
in combinations(theme.split(), 5)
])
texts = random.choices(variations, k=500)
#phrases_to_match = [['this is', 'u'], ['oh you', 'me']]
phrases_to_match = [
['pack my box', 'quick brown', 'the quick', 'brown fox'],
['jumped over', 'lazy dog'],
['five dozen', 'liquor', 'jugs']
]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With