Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can Stanford CoreNLP Named Entity Recognition capture measurements like 5 inches, 5", 5 in., 5 in

I'm looking to capture measurements using Stanford CoreNLP. (If you can suggest a different extractor, that is fine too.)

For example, I want to find 15kg, 15 kg, 15.0 kg, 15 kilogram, 15 lbs, 15 pounds, etc. But among CoreNLPs extraction rules, I don't see one for measurements.

Of course, I can do this with pure regexes, but toolkits can run more quickly, and they offer the opportunity to chunk at a higher level, e.g. to treat gb and gigabytes together, and RAM and memory as building blocks--even without full syntactic parsing--as they build bigger units like 128 gb RAM and 8 gigabytes memory.

I want an extractor for this that is rule-based, not machine-learning-based), but don't see one as part of RegexNer or elsewhere. How do I go about this?

IBM Named Entity Extraction can do this. The regexes are run in an efficient way rather than passing the text through each one. And the regexes are bundled to express meaningful entities, as for example one that unites all the measurement units into a single concept.

like image 724
Joshua Fox Avatar asked Dec 13 '15 14:12

Joshua Fox


2 Answers

I don't think a rule-based system exists for this particular task. However, it shouldn't be hard to make with TokensregexNER. For example, a mapping like:

[{ner:NUMBER}]+ /(k|m|g|t)b/ memory?   MEMORY
[{ner:NUMBER}]+ /"|''|in(ches)?/       LENGTH
...

You could try using vanilla TokensRegex as well, and then just extract out the relevant value with a capture group:

(?$group_name [{ner:NUMBER}]+) /(k|m|g|t)b/ memory?
like image 143
Gabor Angeli Avatar answered Oct 12 '22 14:10

Gabor Angeli


You can build your own training data and label the required measurements accordingly.

For example if you have a sentence like Jack weighs about 50 kgs

So the model will classify your input as:

Jack, PERSON
weighs, O
about, O
50, MES
kgs, MES

Where MES stands for measurements.

I have recently made training data for the Stanford NER tagger for my customized problem and have built a model for it.

I think for Stanford CoreNLP NER also you can do the same thing

This may be a machine learning-based approach rather than a rule-based approach

like image 32
Rohan Amrute Avatar answered Oct 12 '22 14:10

Rohan Amrute