Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Matching incomplete string in Python dictionary

I have this dictionary in which keys are string, and values are integer, like:

{
...
'X ontology entity': 0, 
'X entity': 1, 
'image quality': 10, 
'right lower kidney': 10, 
'magnetic resonance imaging': 10312, 
'MR imaging': 10312, 
 ...
}

I'm iterating over the keys of this dictionary, trying to match a series of tokens with these keys. Suppose I have the following series of tokens:

MR imaging shows that the patient suffers from infection in right lower kidney.

I just split the above text using whitespaces.

I want to match MR imaging, as well as right lower kidney as they are amongst the keys in the dictionary. So, I have written the following code with which I could just match "MR imaging", and not "right lower kidney". (Note that right lower is not present in the key set)

found = []
for i, t in enumerate(tokens):
    term = [tokens[i]]
    j = deepcopy(i)
    while (' '.join(term) in self.db_terms):
        if j < len(tokens):
            j += 1
            term.append(tokens[j])
    found.append(' '.join(term[:-1]))
return set(found)

I have no idea how I could search "right lower" through the keys, match "right lower kidney" and then go for checking the third index.

Any help would be appreciated! Thanks!

like image 401
inverted_index Avatar asked Jan 01 '23 13:01

inverted_index


2 Answers

It seems like you are dealing with Ngrams. Note, this answer assumes there are many keys in your dictionary as opposed to possible N-grams. In this case, it is more efficient to generate n-grams from the text as opposed to iterating over the dictionary keys (as is the case with the other answer).

Start with defining the keys dictionary.

keys = {
'X ontology entity': 0, 
'X entity': 1, 
'image quality': 10, 
'right lower kidney': 10, 
'magnetic resonance imaging': 10312, 
'MR imaging': 10312, 
}

You will need to generate all N-grams within a range (that you decide), and for each n-gram, determine whether it exists as a key in your dictionary.

import re

def get_ngrams(tokens, ngram_range):
    return {' '.join(tokens[i:i+r]) 
        for i in range(len(tokens)) for r in range(*ngram_range)}

ngram_range = (1, 4) # Right exclusive.
tokens = re.sub(r'[^a-zA-Z]', ' ', text).split()
found_tokens = set(filter(keys.__contains__, get_ngrams(tokens, ngram_range)))

print(found_tokens)
# {'MR imaging', 'right lower kidney'}

Keep in mind, for larger ranges and strings, this becomes an expensive operation.


You can optimise a bit by recognising that not all N-grams need to be stored in memory before filtering. We can save big time using a generator and loop:

def ngrams_generator(tokens, ngram_range):
    yield from (' '.join(tokens[i:i+r]) 
        for i in range(len(tokens)) for r in range(*ngram_range))

found_ngrams = set()
for ngram in ngrams_generator(tokens, ngram_range):
    if ngram in keys:
        found_ngrams.add(ngram)

print(found_ngrams)
# {'MR imaging', 'right lower kidney'}
like image 189
cs95 Avatar answered Jan 04 '23 02:01

cs95


You could do it the other way — start with keys and see if the key is in the sentence. It's certainly simpler. Whether it is efficient (or efficient enough) depends on how large your inputs are.

d = {
    'X ontology entity': 0, 
    'X entity': 1, 
    'image quality': 10, 
    'right lower kidney': 10, 
    'magnetic resonance imaging': 10312, 
    'MR imaging': 10312, 
}

sentence = "MR imaging shows that the patient suffers from infection in right lower kidney."

[key for key in d.keys() if key in sentence]
# ['right lower kidney', 'MR imaging']
like image 21
Mark Avatar answered Jan 04 '23 04:01

Mark