Python's Spacy package has a statistical tokenizer that intelligently splits a sentence into tokens. My question is, is there a package that allows me to go backwards, i.e. from list of tokens to a nicely formatted sentence? Essentially, I want a function that lets me do the following:
>>> toks = ['hello', ',', 'i', 'ca', "n't", 'feel', 'my', 'feet', '!']
>>> some_function(toks)
"Hello, I can't feel my feet!"
It probably needs some sort of statistical/rules-based procedure to know how spacing, capitalization or contractions should work in a proper sentence.
Within spaCy you can always reconstruct the original string using ''.join(token.text_with_ws for token in doc)
. If all you have is a list of strings, there's not really a good deterministic solution. You could train a reverse model or use some approximate rules. I don't know a good general purpose implementation of this detokenize()
function.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With