I've finished gathering my data I plan to use for my corpus, but I'm a bit confused about whether I should normalize the text. I plan to tag & chunk the corpus in the future. Some of NLTK's corpora are all lower case and others aren't.
Can anyone shed some light on this subject, please?
By "normalize" do you just mean making everything lowercase?
The decision about whether to lowercase everything is really dependent of what you plan to do. For some purposes, lowercasing everything is better because it lowers the sparsity of the data (uppercase words are rarer and might confuse the system unless you have a massive corpus such that the statistics on capitalized words are decent). In other tasks, case information might be valuable.
Additionally, there are other considerations you'll have to make that are similar.  For example, should "can't" be treated as ["can't"], ["can", "'t"], or ["ca", "n't"] (I've seen all three in different corpora).  What about 7-year-old?  Is it one long word?  Or three words that should be separated?
That said, there's no reason to reformat the corpus. You can just have your code make these changes on the fly. That way the original information is still around later if you ever need it.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With