Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python Untokenize a sentence

There are so many guides on how to tokenize a sentence, but i didn't find any on how to do the opposite.

 import nltk  words = nltk.word_tokenize("I've found a medicine for my disease.")  result I get is: ['I', "'ve", 'found', 'a', 'medicine', 'for', 'my', 'disease', '.'] 

Is there any function than reverts the tokenized sentence to the original state. The function tokenize.untokenize() for some reason doesn't work.

Edit:

I know that I can do for example this and this probably solves the problem but I am curious is there an integrated function for this:

result = ' '.join(sentence).replace(' , ',',').replace(' .','.').replace(' !','!') result = result.replace(' ?','?').replace(' : ',': ').replace(' \'', '\'')    
like image 881
Brana Avatar asked Feb 22 '14 00:02

Brana


People also ask

Can you reverse tokenization?

If tokenization is like a poker chip, encryption is like a lockbox. Additionally, encrypted numbers can be decrypted with the appropriate key. Tokens, however, cannot be reversed, because there is no significant mathematical relationship between the token and its original number.

What is treebank word Tokenizer?

The Treebank tokenizer uses regular expressions to tokenize text as in Penn Treebank. This implementation is a port of the tokenizer sed script written by Robert McIntyre and available at http://www.cis.upenn.edu/~treebank/tokenizer.sed. class nltk.tokenize.treebank.

What is Moses Tokenizer?

Project description. This package provides wrappers for some pre-processing Perl scripts from the Moses toolkit, namely, normalize-punctuation.


Video Answer


1 Answers

You can use "treebank detokenizer" - TreebankWordDetokenizer:

from nltk.tokenize.treebank import TreebankWordDetokenizer TreebankWordDetokenizer().detokenize(['the', 'quick', 'brown']) # 'The quick brown' 

There is also MosesDetokenizer which was in nltk but got removed because of the licensing issues, but it is available as a Sacremoses standalone package.

like image 125
alecxe Avatar answered Sep 29 '22 04:09

alecxe