Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extracting age-related info from text

Tags:

python

regex

nlp

I am trying to find mention of age in a large dataset of messages posted by users on the internet (stored in a .csv)

I am currently using regular expressions in python to extract age and save it in a list

For example, "I am 20 years old" would return 20 to the list "He is 30 now" would return 30 "She is in her fifties" would return 50

But the problem is, using RE is very slow for a huge dataset and if text is in a pattern not satisfied by my RE, then I cannot get the age... So, my question is: Is there a better way of doing this? Perhaps some NLP packages/tools in python? I tried researching if nltk has something for this, but it doesnt.

ps:Sorry if the question is unclear, english is not my first language.. I have included some of the RE i used below..

m = re.search(r'.*(I|He|She) (is|am) ([0-9]{2}).*',s,re.IGNORECASE)
n = re.search(r'.*(I|He|She) (is|am) in (my|his|her) (late|mid|early)? ?(tens|twenties|thirties|forties|fifties|sixties|seventies|eighties|nineties|hundreds).*',s,re.IGNORECASE)
o = re.search(r'.*(I|He|She) (is|am) (twenty|thirty|forty|fifty|sixty|seventy|eighty|ninety|one|two|three|four|five|six|seven|eight|nine|ten|eleven|twelve|thirteen|fourteen|fifteen|sixteen|seventeen|eighteen|nineteen) ?(one|two|three|four|five|six|seven|eight|nine)?.*',s,re.IGNORECASE)
p = re.search(r'.*(age|is|@|was) ([0-9]{2}).*',s,re.IGNORECASE)
q = re.search(r'.*(age|is|@|was) (twenty|thirty|forty|fifty|sixty|seventy|eighty|ninety|one|two|three|four|five|six|seven|eight|nine|ten|eleven|twelve|thirteen|fourteen|fifteen|sixteen|seventeen|eighteen|nineteen) ?(one|two|three|four|five|six|seven|eight|nine)?.*',s,re.IGNORECASE)
r = re.search(r'.*([0-9]{2}) (yrs|years).*',s,re.IGNORECASE)
s = re.search(r'.*(twenty|thirty|forty|fifty|sixty|seventy|eighty|ninety|one|two|three|four|five|six|seven|eight|nine|ten|eleven|twelve|thirteen|fourteen|fifteen|sixteen|seventeen|eighteen|nineteen) ?(one|two|three|four|five|six|seven|eight|nine)? (yrs|years).*',s,re.IGNORECASE)
like image 615
krzna Avatar asked Apr 10 '15 23:04

krzna


People also ask

Is the process of Extracting meaningful information from text data?

The task of Information Extraction (IE) involves extracting meaningful information from unstructured text data and presenting it in a structured format.

What is information extraction NLP?

Information extraction is the process of extracting information from unstructured textual sources to enable finding entities as well as classifying and storing them in a database.


2 Answers

See Extracting a person's age from unstructured text in Python, particularly the answer to do with using Allen NLP, which appears to be just what you're asking for.

like image 173
chaos Avatar answered Oct 06 '22 04:10

chaos


I'd like to recommend you train a neural network with three multiclass classifiers/heads to predict three digits corresponding to the ones, tens and hundreds.

like image 27
Lerner Zhang Avatar answered Oct 06 '22 03:10

Lerner Zhang