I want to extract Cardinal(CD) values associated with Units of Measurement and store it in a dictionary. For example if the text contains tokens like "20 kgs", it should extract it and keep it in a dictionary.
Example:
for input text, “10-inch fry pan offers superb heat conductivity and distribution”, the output dictionary should look like, {"dimension":"10-inch"}
for input text, "This bucket holds 5 litres of water.", the output should look like, {"volume": "5 litres"}
line = 'This bucket holds 5 litres of water.'
tokenized = nltk.word_tokenize(line)
tagged = nltk.pos_tag(tokenized)
The above line would give the output:
[('This', 'DT'), ('bucket', 'NN'), ('holds', 'VBZ'), ('5', 'CD'), ('litres', 'NNS'), ('of', 'IN'), ('water', 'NN'), ('.', '.')]
Is there a way to extract the CD and UOM values from the text?
Not sure how flexible you need the process to be. You can play around with nltk.RegexParser and come up with some good patters:
import nltk
sentence = 'This bucket holds 5 litres of water.'
parser = nltk.RegexpParser(
"""
INDICATOR: {<CD><NNS>}
""")
print parser.parse(nltk.pos_tag(nltk.word_tokenize(sentence)))
Output:
(S
This/DT
bucket/NN
holds/VBZ
(INDICATOR 5/CD litres/NNS)
of/IN
water/NN
./.)
You can also create a corpus and train a chunker.
Hm, not sure if it helps - but I wrote it in Javascript. Here: http://github.com/redaktor/nlp_compromise
It might be a bit undocumented yet but the guys are porting it to a 2.0 branch now.
It should be easy to port to python considering What's different between Python and Javascript regular expressions?
And : Did you check pythons NLTK? : http://www.nltk.org
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With