nltk tokenization and contractions

Tags:

I'm tokenizing text with nltk, just sentences fed to wordpunct_tokenizer. This splits contractions (e.g. 'don't' to 'don' +" ' "+'t') but I want to keep them as one word. I'm refining my methods for a more measured and precise tokenization of text, so I need to delve deeper into the nltk tokenization module beyond simple tokenization.

I'm guessing this is common and I'd like feedback from others who've maybe had to deal with the particular issue before.

edit:

Yeah this a general, splattershot question I know

Also, as a novice to nlp, do I need to worry about contractions at all?

EDIT:

The SExprTokenizer or TreeBankWordTokenizer seems to do what I'm looking for for now.

927

asked Jul 05 '12 19:07

blueblank

2 Answers

Which tokenizer you use really depends on what you want to do next. As inspectorG4dget said, some part-of-speech taggers handle split contractions, and in that case the splitting is a good thing. But maybe that's not what you want. To decide which tokenizer is best, consider what you need for the next step, and then submit your text to http://text-processing.com/demo/tokenize/ to see how each NLTK tokenizer behaves.

answered Oct 08 '22 15:10

Jacob

I've worked with NLTK before on this project. When I did, I found that contractions were useful to consider.

However, I did not write custom tokenizer, I simply handled it after POS tagging.

I suspect this is not the answer that you are looking for, but I hope it helps somewhat

answered Oct 08 '22 14:10

inspectorG4dget

Related questions
                            
                                Difference between WSGI utilities and Web Servers [closed]
                            
                                pandas - merging with missing values
                            
                                Python requests: download only if newer
                            
                                Python self and super in multiple inheritance
                            
                                Scipy.sparse.csr_matrix: How to get top ten values and indices?
                            
                                Python: Creating desktop application with HTML GUI [closed]
                            
                                SQLAlchemy ORM: Polymorphic Single Table Inheritance, with fallback to parent class if "polymorphic_identity" is not found
                            
                                Python argparse : How can I get Namespace objects for argument groups separately?
                            
                                How to make an Abstract Class inherit from another Abstract Class in Python?
                            
                                How to remove a word completely from a Word2Vec model in gensim?
                            
                                How to bypass Incapsula with Python
                            
                                How can I create an local webserver for my python scripts?
                            
                                Python doesn't detect a closed socket until the second send
                            
                                Running python script with cron only if not running
                            
                                Fastest way in Python to find a 'startswith' substring in a long sorted list of strings
                            
                                Reproducibility of python pseudo-random numbers across systems and versions?
                            
                                Is there any way to pass 'stdin' as an argument to another process in python?
                            
                                python matplotlib imshow() custom tickmarks
                            
                                Django logging of custom management commands
                            
                                python2.7: logging configuration with yaml

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

nltk tokenization and contractions

Tags:

python

nlp

nltk

blueblank

People also ask

2 Answers

Jacob

inspectorG4dget

Recent Activity

Donate For Us