I am trying to find mention of age in a large dataset of messages posted by users on the internet (stored in a .csv)
I am currently using regular expressions in python to extract age and save it in a list
For example, "I am 20 years old" would return 20 to the list "He is 30 now" would return 30 "She is in her fifties" would return 50
But the problem is, using RE is very slow for a huge dataset and if text is in a pattern not satisfied by my RE, then I cannot get the age... So, my question is: Is there a better way of doing this? Perhaps some NLP packages/tools in python? I tried researching if nltk has something for this, but it doesnt.
ps:Sorry if the question is unclear, english is not my first language.. I have included some of the RE i used below..
m = re.search(r'.*(I|He|She) (is|am) ([0-9]{2}).*',s,re.IGNORECASE)
n = re.search(r'.*(I|He|She) (is|am) in (my|his|her) (late|mid|early)? ?(tens|twenties|thirties|forties|fifties|sixties|seventies|eighties|nineties|hundreds).*',s,re.IGNORECASE)
o = re.search(r'.*(I|He|She) (is|am) (twenty|thirty|forty|fifty|sixty|seventy|eighty|ninety|one|two|three|four|five|six|seven|eight|nine|ten|eleven|twelve|thirteen|fourteen|fifteen|sixteen|seventeen|eighteen|nineteen) ?(one|two|three|four|five|six|seven|eight|nine)?.*',s,re.IGNORECASE)
p = re.search(r'.*(age|is|@|was) ([0-9]{2}).*',s,re.IGNORECASE)
q = re.search(r'.*(age|is|@|was) (twenty|thirty|forty|fifty|sixty|seventy|eighty|ninety|one|two|three|four|five|six|seven|eight|nine|ten|eleven|twelve|thirteen|fourteen|fifteen|sixteen|seventeen|eighteen|nineteen) ?(one|two|three|four|five|six|seven|eight|nine)?.*',s,re.IGNORECASE)
r = re.search(r'.*([0-9]{2}) (yrs|years).*',s,re.IGNORECASE)
s = re.search(r'.*(twenty|thirty|forty|fifty|sixty|seventy|eighty|ninety|one|two|three|four|five|six|seven|eight|nine|ten|eleven|twelve|thirteen|fourteen|fifteen|sixteen|seventeen|eighteen|nineteen) ?(one|two|three|four|five|six|seven|eight|nine)? (yrs|years).*',s,re.IGNORECASE)
The task of Information Extraction (IE) involves extracting meaningful information from unstructured text data and presenting it in a structured format.
Information extraction is the process of extracting information from unstructured textual sources to enable finding entities as well as classifying and storing them in a database.
See Extracting a person's age from unstructured text in Python, particularly the answer to do with using Allen NLP, which appears to be just what you're asking for.
I'd like to recommend you train a neural network with three multiclass classifiers/heads to predict three digits corresponding to the ones, tens and hundreds.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With