NLP reverse tokenizing (going from tokens to nicely formatted sentence)

Question

Python's Spacy package has a statistical tokenizer that intelligently splits a sentence into tokens. My question is, is there a package that allows me to go backwards, i.e. from list of tokens to a nicely formatted sentence? Essentially, I want a function that lets me do the following:

>>> toks = ['hello', ',', 'i', 'ca', "n't", 'feel', 'my', 'feet', '!']
>>> some_function(toks)
"Hello, I can't feel my feet!"

It probably needs some sort of statistical/rules-based procedure to know how spacing, capitalization or contractions should work in a proper sentence.

syllogism_ · Accepted Answer

Within spaCy you can always reconstruct the original string using ''.join(token.text_with_ws for token in doc). If all you have is a list of strings, there's not really a good deterministic solution. You could train a reverse model or use some approximate rules. I don't know a good general purpose implementation of this detokenize() function.

NLP reverse tokenizing (going from tokens to nicely formatted sentence)

Tags:

python

nlp

spacy

Nigel Ng

1 Answers

syllogism_

Recent Activity

Donate For Us

NLP reverse tokenizing (going from tokens to nicely formatted sentence)

Tags:

python

nlp

spacy

Nigel Ng

1 Answers

syllogism_

Related questions

Recent Activity

Donate For Us