I am tokenizing a text using nltk.word_tokenize and I would like to also get the index in the original raw text to the first character of every token, i.e.
import nltk
x = 'hello world'
tokens = nltk.word_tokenize(x)
>>> ['hello', 'world']
How can I also get the array [0, 7]
corresponding to the raw indices of the tokens?
word_tokenize() method, we are able to extract the tokens from string of characters by using tokenize. word_tokenize() method. It actually returns the syllables from a single word.
word_tokenize is a function in Python that splits a given sentence into words using the NLTK library.
NLTK contains a module called tokenize() which further classifies into two sub-categories: Word tokenize: We use the word_tokenize() method to split a sentence into tokens or words. Sentence tokenize: We use the sent_tokenize() method to split a document or paragraph into sentences.
I think you are looking for is the span_tokenize()
method.
Apparently this is not supported by the default tokenizer.
Here is a code example with another tokenizer.
from nltk.tokenize import WhitespaceTokenizer
s = "Good muffins cost $3.88\nin New York."
span_generator = WhitespaceTokenizer().span_tokenize(s)
spans = [span for span in span_generator]
print(spans)
Which gives:
[(0, 4), (5, 12), (13, 17), (18, 23), (24, 26), (27, 30), (31, 36)]
just getting the offsets:
offsets = [span[0] for span in spans]
[0, 5, 13, 18, 24, 27, 31]
For further information (on the different tokenizers available) see the tokenize api docs
You can also do this:
def spans(txt):
tokens=nltk.word_tokenize(txt)
offset = 0
for token in tokens:
offset = txt.find(token, offset)
yield token, offset, offset+len(token)
offset += len(token)
s = "And now for something completely different and."
for token in spans(s):
print token
assert token[0]==s[token[1]:token[2]]
And get:
('And', 0, 3)
('now', 4, 7)
('for', 8, 11)
('something', 12, 21)
('completely', 22, 32)
('different', 33, 42)
('.', 42, 43)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With