There are so many guides on how to tokenize a sentence, but i didn't find any on how to do the opposite. <pre class="prettyprint"><code> import nltk words = nltk.word_tokenize("I've found a medicine for my disease.") result I get is: ['I', "'ve", 'found', 'a', 'medicine', 'for', 'my', 'disease', '.'] </code></pre> Is there any function than reverts the tokenized sentence to the original state. The function <code>tokenize.untokenize()</code> for some reason doesn't work. Edit: I know that I can do for example this and this probably solves the problem but I am curious is there an integrated function for this: <pre class="prettyprint"><code>result = ' '.join(sentence).replace(' , ',',').replace(' .','.').replace(' !','!') result = result.replace(' ?','?').replace(' : ',': ').replace(' \'', '\'') </code></pre>

You can use "treebank detokenizer" - <code>TreebankWordDetokenizer</code>: <pre class="prettyprint"><code>from nltk.tokenize.treebank import TreebankWordDetokenizer TreebankWordDetokenizer().detokenize(['the', 'quick', 'brown']) # 'The quick brown' </code></pre> <hr> There is also <code>MosesDetokenizer</code> which was in <code>nltk</code> but got removed because of the licensing issues, but it is available as a <code>Sacremoses</code> standalone package.

Python Untokenize a sentence

Tags:

python

python-2.7

nltk

There are so many guides on how to tokenize a sentence, but i didn't find any on how to do the opposite.

 import nltk  words = nltk.word_tokenize("I've found a medicine for my disease.")  result I get is: ['I', "'ve", 'found', 'a', 'medicine', 'for', 'my', 'disease', '.']

Is there any function than reverts the tokenized sentence to the original state. The function tokenize.untokenize() for some reason doesn't work.

Edit:

I know that I can do for example this and this probably solves the problem but I am curious is there an integrated function for this:

result = ' '.join(sentence).replace(' , ',',').replace(' .','.').replace(' !','!') result = result.replace(' ?','?').replace(' : ',': ').replace(' \'', '\'')

881

asked Feb 22 '14 00:02

Brana

Video Answer

1 Answers

You can use "treebank detokenizer" - TreebankWordDetokenizer:

from nltk.tokenize.treebank import TreebankWordDetokenizer TreebankWordDetokenizer().detokenize(['the', 'quick', 'brown']) # 'The quick brown'

There is also MosesDetokenizer which was in nltk but got removed because of the licensing issues, but it is available as a Sacremoses standalone package.

125

answered Sep 29 '22 04:09

alecxe

Related questions
                            
                                Python list function argument names [duplicate]
                            
                                Using python's logging module to log all exceptions and errors
                            
                                How to return all the minimum indices in numpy
                            
                                how to get which statements are missed in python test coverage
                            
                                ValueError: Can not squeeze dim[1], expected a dimension of 1, got 3 for 'sparse_softmax_cross_entropy_loss
                            
                                Using Cython with Django. Does it make sense?
                            
                                How to run PyCharm in Ubuntu - "Run in Terminal" or "Run"?
                            
                                Python 3 UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d
                            
                                Sampling uniformly distributed random points inside a spherical volume
                            
                                How to get list of all variables in jinja 2 templates
                            
                                Execute .sql schema in psycopg2 in Python
                            
                                What's the best way to split a string into fixed length chunks and work with them in Python?
                            
                                Seaborn RegPlot Partially See Through (alpha)
                            
                                Anaconda Installed but Cannot Launch Navigator
                            
                                Finding max value in the second column of a nested list?
                            
                                Reverse a string in Python two characters at a time (Network byte order)
                            
                                Benefits of panda's multiindex?
                            
                                Any way to override the and operator in Python?
                            
                                Grep on elements of a list
                            
                                scipy.stats seed?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With