Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

NLTK - when to normalize the text?

Tags:

python

nlp

nltk

I've finished gathering my data I plan to use for my corpus, but I'm a bit confused about whether I should normalize the text. I plan to tag & chunk the corpus in the future. Some of NLTK's corpora are all lower case and others aren't.

Can anyone shed some light on this subject, please?

like image 963
greg34 Avatar asked Jul 20 '11 20:07

greg34


1 Answers

By "normalize" do you just mean making everything lowercase?

The decision about whether to lowercase everything is really dependent of what you plan to do. For some purposes, lowercasing everything is better because it lowers the sparsity of the data (uppercase words are rarer and might confuse the system unless you have a massive corpus such that the statistics on capitalized words are decent). In other tasks, case information might be valuable.

Additionally, there are other considerations you'll have to make that are similar. For example, should "can't" be treated as ["can't"], ["can", "'t"], or ["ca", "n't"] (I've seen all three in different corpora). What about 7-year-old? Is it one long word? Or three words that should be separated?

That said, there's no reason to reformat the corpus. You can just have your code make these changes on the fly. That way the original information is still around later if you ever need it.

like image 101
dhg Avatar answered Sep 28 '22 07:09

dhg