I have this dictionary in which keys are string, and values are integer, like:
{
...
'X ontology entity': 0,
'X entity': 1,
'image quality': 10,
'right lower kidney': 10,
'magnetic resonance imaging': 10312,
'MR imaging': 10312,
...
}
I'm iterating over the keys of this dictionary, trying to match a series of tokens with these keys. Suppose I have the following series of tokens:
MR imaging shows that the patient suffers from infection in right lower kidney.
I just split the above text using whitespaces.
I want to match MR imaging, as well as right lower kidney as they are amongst the keys in the dictionary. So, I have written the following code with which I could just match "MR imaging", and not "right lower kidney". (Note that right lower is not present in the key set)
found = []
for i, t in enumerate(tokens):
term = [tokens[i]]
j = deepcopy(i)
while (' '.join(term) in self.db_terms):
if j < len(tokens):
j += 1
term.append(tokens[j])
found.append(' '.join(term[:-1]))
return set(found)
I have no idea how I could search "right lower" through the keys, match "right lower kidney" and then go for checking the third index.
Any help would be appreciated! Thanks!
It seems like you are dealing with Ngrams. Note, this answer assumes there are many keys in your dictionary as opposed to possible N-grams. In this case, it is more efficient to generate n-grams from the text as opposed to iterating over the dictionary keys (as is the case with the other answer).
Start with defining the keys
dictionary.
keys = {
'X ontology entity': 0,
'X entity': 1,
'image quality': 10,
'right lower kidney': 10,
'magnetic resonance imaging': 10312,
'MR imaging': 10312,
}
You will need to generate all N-grams within a range (that you decide), and for each n-gram, determine whether it exists as a key in your dictionary.
import re
def get_ngrams(tokens, ngram_range):
return {' '.join(tokens[i:i+r])
for i in range(len(tokens)) for r in range(*ngram_range)}
ngram_range = (1, 4) # Right exclusive.
tokens = re.sub(r'[^a-zA-Z]', ' ', text).split()
found_tokens = set(filter(keys.__contains__, get_ngrams(tokens, ngram_range)))
print(found_tokens)
# {'MR imaging', 'right lower kidney'}
Keep in mind, for larger ranges and strings, this becomes an expensive operation.
You can optimise a bit by recognising that not all N-grams need to be stored in memory before filtering. We can save big time using a generator and loop:
def ngrams_generator(tokens, ngram_range):
yield from (' '.join(tokens[i:i+r])
for i in range(len(tokens)) for r in range(*ngram_range))
found_ngrams = set()
for ngram in ngrams_generator(tokens, ngram_range):
if ngram in keys:
found_ngrams.add(ngram)
print(found_ngrams)
# {'MR imaging', 'right lower kidney'}
You could do it the other way — start with keys and see if the key is in the sentence. It's certainly simpler. Whether it is efficient (or efficient enough) depends on how large your inputs are.
d = {
'X ontology entity': 0,
'X entity': 1,
'image quality': 10,
'right lower kidney': 10,
'magnetic resonance imaging': 10312,
'MR imaging': 10312,
}
sentence = "MR imaging shows that the patient suffers from infection in right lower kidney."
[key for key in d.keys() if key in sentence]
# ['right lower kidney', 'MR imaging']
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With