Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I best determine the correct capitalization for a word?

Tags:

nlp

I have a database containing sentences which only contain capitalized letters. The database is technical, containing medical terms, and I want to normalize it so that the capitalization is (close to) what the user expects. What is the best way to achieve this? Is there a freely available dataset I can use to help with the process?

like image 321
Mike Avatar asked Oct 09 '11 21:10

Mike


1 Answers

One way could be to infer capitalization from POS-tagging, for example using the Python Natural Language Toolkit (NLTK):

import nltk, re

def truecase(text):
    truecased_sents = [] # list of truecased sentences
    # apply POS-tagging
    tagged_sent = nltk.pos_tag([word.lower() for word in nltk.word_tokenize(text)])
    # infer capitalization from POS-tags
    normalized_sent = [w.capitalize() if t in ["NN","NNS"] else w for (w,t) in tagged_sent]
    # capitalize first word in sentence
    normalized_sent[0] = normalized_sent[0].capitalize()
    # use regular expression to get punctuation right
    pretty_string = re.sub(" (?=[\.,'!?:;])", "", ' '.join(normalized_sent))
    return pretty_string

This will not be perfect, especially because I don't know what your data exactely looks like, but maybe you can get the idea:

>>> text = "Clonazepam Has Been Approved As An Anticonvulsant To Be Manufactured In 0.5mg, 1mg And 2mg Tablets. It Is The Generic Equivalent Of Roche Laboratories' Klonopin."
>>> truecase(text)
"Clonazepam has been approved as an anticonvulsant to be manufactured in 0.5mg, 1mg and 2mg Tablets. It is the generic Equivalent of Roche Laboratories' Klonopin."
like image 51
tobigue Avatar answered Oct 21 '22 21:10

tobigue