Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to extract numbers (along with comparison adjectives or ranges)

I am working on two NLP projects in Python, and both have a similar task to extract numerical values and comparison operators from sentences, like the following:

"... greater than $10 ... ", "... weight not more than 200lbs ...", "... height in 5-7 feets ...", "... faster than 30 seconds ... " 

I found two different approaches to solve this problem:

  • using very complex regular expressions.
  • using Named Entity Recognition (and some regexes, too).

How can I parse numerical values out of such sentences? I assume this is a common task in NLP.


The desired output would be something like:

Input:

"greater than $10"

Output:

{'value': 10, 'unit': 'dollar', 'relation': 'gt', 'position': 3} 
like image 204
svfat Avatar asked Jul 16 '17 07:07

svfat


People also ask

How do I extract numbers from a word in Python?

This problem can be solved by using split function to convert string to list and then the list comprehension which can help us iterating through the list and isdigit function helps to get the digit out of a string.

How do I find the numeric value of a string in Python?

The isnumeric() method returns True if all the characters are numeric (0-9), otherwise False. Exponents, like ² and ¾ are also considered to be numeric values.


1 Answers

I would probably approach this as a chunking task and use nltk's part of speech tagger combined with its regular expression chunker. This will allow you to define a regular expression based on the part of speech of the words in your sentences instead of on the words themselves. For a given sentence, you can do the following:

import nltk  # example sentence sent = 'send me a table with a price greater than $100' 

The first thing I would do is to modify your sentences slightly so that you don't confuse the part of speech tagger too much. Here are some examples of changes that you can make (with very simple regular expressions) but you can experiment and see if there are others:

$10 -> 10 dollars 200lbs -> 200 lbs 5-7 -> 5 - 7 OR 5 to 7 

so we get:

sent = 'send me a table with a price greater than 100 dollars' 

now you can get the parts of speech from your sentence:

sent_pos = nltk.pos_tag(sent.split()) print(sent_pos)  [('send', 'VB'), ('me', 'PRP'), ('a', 'DT'), ('table', 'NN'), ('with', 'IN'), ('a', 'DT'), ('price', 'NN'), ('greater', 'JJR'), ('than', 'IN'), ('100', 'CD'), ('dollars', 'NNS')] 

We can now create a chunker which will chunk your POS tagged text according to a (relatively) simple regular expression:

grammar = 'NumericalPhrase: {<NN|NNS>?<RB>?<JJR><IN><CD><NN|NNS>?}' parser = nltk.RegexpParser(grammar) 

This defines a parser with a grammar that chunks numerical phrases (what we'll call your phrase type). It defines your numerical phrase as: an optional noun, followed by an optional adverb, followed by a comparative adjective, a preposition, a number, and an optional noun. This is just a suggestion for how you may want to define your phrases, but I think that this will be much simpler than using a regular expression on the words themselves.

To get your phrases you can do:

print(parser.parse(sent_pos)) (S   send/VB   me/PRP   a/DT   table/NN   with/IN   a/DT   (NumericalPhrase price/NN greater/JJR than/IN 100/CD dollars/NNS))   

Or to get only your phrases you can do:

print([tree.leaves() for tree in parser.parse(sent_pos).subtrees() if tree.label() == 'NumericalPhrase'])  [[('price', 'NN'),   ('greater', 'JJR'),   ('than', 'IN'),   ('100', 'CD'),   ('dollars', 'NNS')]] 
like image 82
bunji Avatar answered Sep 22 '22 05:09

bunji