There are so many guides on how to tokenize a sentence, but i didn't find any on how to do the opposite.
import nltk words = nltk.word_tokenize("I've found a medicine for my disease.") result I get is: ['I', "'ve", 'found', 'a', 'medicine', 'for', 'my', 'disease', '.']
Is there any function than reverts the tokenized sentence to the original state. The function tokenize.untokenize()
for some reason doesn't work.
Edit:
I know that I can do for example this and this probably solves the problem but I am curious is there an integrated function for this:
result = ' '.join(sentence).replace(' , ',',').replace(' .','.').replace(' !','!') result = result.replace(' ?','?').replace(' : ',': ').replace(' \'', '\'')
If tokenization is like a poker chip, encryption is like a lockbox. Additionally, encrypted numbers can be decrypted with the appropriate key. Tokens, however, cannot be reversed, because there is no significant mathematical relationship between the token and its original number.
The Treebank tokenizer uses regular expressions to tokenize text as in Penn Treebank. This implementation is a port of the tokenizer sed script written by Robert McIntyre and available at http://www.cis.upenn.edu/~treebank/tokenizer.sed. class nltk.tokenize.treebank.
Project description. This package provides wrappers for some pre-processing Perl scripts from the Moses toolkit, namely, normalize-punctuation.
You can use "treebank detokenizer" - TreebankWordDetokenizer
:
from nltk.tokenize.treebank import TreebankWordDetokenizer TreebankWordDetokenizer().detokenize(['the', 'quick', 'brown']) # 'The quick brown'
There is also MosesDetokenizer
which was in nltk
but got removed because of the licensing issues, but it is available as a Sacremoses
standalone package.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With