Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

NLP reverse tokenizing (going from tokens to nicely formatted sentence)

Tags:

python

nlp

spacy

Python's Spacy package has a statistical tokenizer that intelligently splits a sentence into tokens. My question is, is there a package that allows me to go backwards, i.e. from list of tokens to a nicely formatted sentence? Essentially, I want a function that lets me do the following:

>>> toks = ['hello', ',', 'i', 'ca', "n't", 'feel', 'my', 'feet', '!']
>>> some_function(toks)
"Hello, I can't feel my feet!"

It probably needs some sort of statistical/rules-based procedure to know how spacing, capitalization or contractions should work in a proper sentence.

like image 294
Nigel Ng Avatar asked May 24 '17 12:05

Nigel Ng


1 Answers

Within spaCy you can always reconstruct the original string using ''.join(token.text_with_ws for token in doc). If all you have is a list of strings, there's not really a good deterministic solution. You could train a reverse model or use some approximate rules. I don't know a good general purpose implementation of this detokenize() function.

like image 90
syllogism_ Avatar answered Sep 27 '22 23:09

syllogism_