How can I best determine the correct capitalization for a word?

Question

I have a database containing sentences which only contain capitalized letters. The database is technical, containing medical terms, and I want to normalize it so that the capitalization is (close to) what the user expects. What is the best way to achieve this? Is there a freely available dataset I can use to help with the process?

tobigue · Accepted Answer

One way could be to infer capitalization from POS-tagging, for example using the Python Natural Language Toolkit (NLTK):

import nltk, re

def truecase(text):
    truecased_sents = [] # list of truecased sentences
    # apply POS-tagging
    tagged_sent = nltk.pos_tag([word.lower() for word in nltk.word_tokenize(text)])
    # infer capitalization from POS-tags
    normalized_sent = [w.capitalize() if t in ["NN","NNS"] else w for (w,t) in tagged_sent]
    # capitalize first word in sentence
    normalized_sent[0] = normalized_sent[0].capitalize()
    # use regular expression to get punctuation right
    pretty_string = re.sub(" (?=[\.,'!?:;])", "", ' '.join(normalized_sent))
    return pretty_string

This will not be perfect, especially because I don't know what your data exactely looks like, but maybe you can get the idea:

>>> text = "Clonazepam Has Been Approved As An Anticonvulsant To Be Manufactured In 0.5mg, 1mg And 2mg Tablets. It Is The Generic Equivalent Of Roche Laboratories' Klonopin."
>>> truecase(text)
"Clonazepam has been approved as an anticonvulsant to be manufactured in 0.5mg, 1mg and 2mg Tablets. It is the generic Equivalent of Roche Laboratories' Klonopin."

How can I best determine the correct capitalization for a word?

Tags:

nlp

Mike

1 Answers

tobigue

Recent Activity

Donate For Us

How can I best determine the correct capitalization for a word?

Tags:

nlp

Mike

1 Answers

tobigue

Related questions

Recent Activity

Donate For Us