Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Finding neighbors in string list

I have a list and a set, I would like to find all occurences of an element from the set within the list and get the string to left and right of word in the set.

I have this written right now:

unique_set = set(text)
context_key = {}
bad_counter = 0

for i,j in enumerate(unique_set):
    context_list = []
    if j in text:
        context = []
        context.append(text[i-1])
        context.append(text[i])
        context.append(text[i+1])
        if j in context:
            context_list.append(context)
            context_key[j] = context_list
        else:
            bad_counter += 1

print(bad_counter)
print(context_key)

This seems to actually iterate through both the set and the list however I put the counter in there to see how many values its missing its not adding up, since the length for the full text list is about 130k, 15k misses should be bad, however ever ONLY 3 k,v pairs are what is throwing me off. This is the output:

15928
{'compost': [['gardens', 'compost', 'heaps']], 'extra': [['color', 'hair', 'extra']], 'commercial': [['commercial', 'first', 'came']]} 

The end goal with this is to append the unique value in the set as the dict key to context_key and any lists that contain that value as the dict value for context_key

like image 882
Sebastian Goslin Avatar asked Mar 20 '26 02:03

Sebastian Goslin


1 Answers

If all you want to do is make a list of the word itself, the word before, and the word after then this should do the trick:

text = "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.".split(" ")

unique_set = set(text)
context_key = {}

for i,j in enumerate(unique_set):
    if i in (0, len(text)-1):
        continue

    indices = [i for i, x in enumerate(text) if x == j]

    contexts = []

    for index in indices:
        this_context = []

        word = j
        word_before = text[i-1]
        word_after = text[i+1]

        this_context.append(word_before)
        this_context.append(word)
        this_context.append(word_after)

        contexts.append(this_context)

    context_key[j] = contexts

print(context_key)

Output:

{'consectetur': [['Lorem', 'consectetur', 'dolor']], 'proident,': [['ipsum', 'proident,', 'sit']], 'quis': [['dolor', 'quis', 'amet,']], 'labore': [['sit', 'labore', 'consectetur']], 'esse': [['amet,', 'esse', 'adipiscing']], 'ex': [['consectetur', 'ex', 'elit,']], 'ea': [['adipiscing', 'ea', 'sed']], 'aliqua.': [['elit,', 'aliqua.', 'do']], 'aute': [['sed', 'aute', 'eiusmod']], 'reprehenderit': [['do', 'reprehenderit', 'tempor']], 'amet,': [['eiusmod', 'amet,', 'incididunt']], 'veniam,': [['tempor', 'veniam,', 'ut']], 'Duis': [['incididunt', 'Duis', 'labore']], 'pariatur.': [['ut', 'pariatur.', 'et']], 'est': [['labore', 'est', 'dolore']], 'commodo': [['et', 'commodo', 'magna']], 'id': [['dolore', 'id', 'aliqua.']], 'voluptate': [['magna', 'voluptate', 'Ut']], 'cupidatat': [['aliqua.', 'cupidatat', 'enim']], 'velit': [['Ut', 'velit', 'ad']], 'sit': [['enim', 'sit', 'minim']], 'elit,': [['ad', 'elit,', 'veniam,']], 'dolore': [['minim', 'dolore', 'quis'], ['minim', 'dolore', 'quis']], 'consequat.': [['veniam,', 'consequat.', 'nostrud']], 'cillum': [['quis', 'cillum', 'exercitation']], 'minim': [['nostrud', 'minim', 'ullamco']], 'exercitation': [['exercitation', 'exercitation', 'laboris']], 'magna': [['ullamco', 'magna', 'nisi']], 'sunt': [['laboris', 'sunt', 'ut']], 'sint': [['nisi', 'sint', 'aliquip']], 'eu': [['ut', 'eu', 'ex']], 'nulla': [['aliquip', 'nulla', 'ea']], 'laborum.': [['ex', 'laborum.', 'commodo']], 'nostrud': [['ea', 'nostrud', 'consequat.']], 'in': [['commodo', 'in', 'Duis'], ['commodo', 'in', 'Duis'], ['commodo', 'in', 'Duis']], 'incididunt': [['consequat.', 'incididunt', 'aute']], 'ut': [['Duis', 'ut', 'irure'], ['Duis', 'ut', 'irure']], 'culpa': [['aute', 'culpa', 'dolor']], 'mollit': [['irure', 'mollit', 'in']], 'laboris': [['dolor', 'laboris', 'reprehenderit']], 'ipsum': [['in', 'ipsum', 'in']], 'Lorem': [['reprehenderit', 'Lorem', 'voluptate']], 'Excepteur': [['in', 'Excepteur', 'velit']], 'deserunt': [['voluptate', 'deserunt', 'esse']], 'aliquip': [['velit', 'aliquip', 'cillum']], 'tempor': [['esse', 'tempor', 'dolore']], 'ullamco': [['cillum', 'ullamco', 'eu']], 'Ut': [['dolore', 'Ut', 'fugiat']], 'enim': [['eu', 'enim', 'nulla']], 'anim': [['fugiat', 'anim', 'pariatur.']], 'fugiat': [['nulla', 'fugiat', 'Excepteur']], 'irure': [['pariatur.', 'irure', 'sint']], 'occaecat': [['Excepteur', 'occaecat', 'occaecat']], 'nisi': [['sint', 'nisi', 'cupidatat']], 'officia': [['occaecat', 'officia', 'non']], 'dolor': [['cupidatat', 'dolor', 'proident,'], ['cupidatat', 'dolor', 'proident,']], 'non': [['non', 'non', 'sunt']], 'do': [['proident,', 'do', 'in']], 'et': [['sunt', 'et', 'culpa']], 'eiusmod': [['in', 'eiusmod', 'qui']], 'sed': [['culpa', 'sed', 'officia']], 'ad': [['qui', 'ad', 'deserunt']], 'adipiscing': [['officia', 'adipiscing', 'mollit']]}

EDIT:

The problem with the code you provided is that the index from the enumerate is from the set, which is not necessarily the same index as it would be in the text. This means that your word doesn't always appear in context, as the words in that context are in alphabetical order from your set, not the list of words in text.

Hope that makes some small amount of sense :)

like image 135
Ed Ward Avatar answered Mar 22 '26 17:03

Ed Ward



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!