get indices of original text from nltk word_tokenize

Tags:

I am tokenizing a text using nltk.word_tokenize and I would like to also get the index in the original raw text to the first character of every token, i.e.

import nltk
x = 'hello world'
tokens = nltk.word_tokenize(x)
>>> ['hello', 'world']

How can I also get the array [0, 7] corresponding to the raw indices of the tokens?

686

asked Jul 28 '15 06:07

genekogan

2 Answers

I think you are looking for is the span_tokenize() method. Apparently this is not supported by the default tokenizer. Here is a code example with another tokenizer.

from nltk.tokenize import WhitespaceTokenizer
s = "Good muffins cost $3.88\nin New York."
span_generator = WhitespaceTokenizer().span_tokenize(s)
spans = [span for span in span_generator]
print(spans)

Which gives:

[(0, 4), (5, 12), (13, 17), (18, 23), (24, 26), (27, 30), (31, 36)]

just getting the offsets:

offsets = [span[0] for span in spans]
[0, 5, 13, 18, 24, 27, 31]

For further information (on the different tokenizers available) see the tokenize api docs

156

answered Jan 01 '23 10:01

b3000

You can also do this:

def spans(txt):
    tokens=nltk.word_tokenize(txt)
    offset = 0
    for token in tokens:
        offset = txt.find(token, offset)
        yield token, offset, offset+len(token)
        offset += len(token)


s = "And now for something completely different and."
for token in spans(s):
    print token
    assert token[0]==s[token[1]:token[2]]

And get:

('And', 0, 3)
('now', 4, 7)
('for', 8, 11)
('something', 12, 21)
('completely', 22, 32)
('different', 33, 42)
('.', 42, 43)

answered Jan 01 '23 10:01

Emilia Apostolova

Related questions
                            
                                Floating Point Numbers [duplicate]
                            
                                How can Tornado serve a single static file at an arbitrary location?
                            
                                ReferenceError: "something" is not defined in QML
                            
                                EVE - define custom flask controllers [closed]
                            
                                How To Delete S3 Files Starting With
                            
                                Does there exist empty class in python?
                            
                                Custom sorting with Pandas
                            
                                Newey-West standard errors for OLS in Python?
                            
                                Python XML Parsing without root
                            
                                print.__doc__ vs getattr(__builtin__,"print").__doc__
                            
                                Check if float is close to any float stored in array
                            
                                Redis scan command match option does not work in Python
                            
                                In what world would \\u00c3\\u00a9 become é?
                            
                                How do I subscribe someone to a list using the python mailchimp API v2.0?
                            
                                How to save a 3 channel numpy array as image
                            
                                Python asyncio force timeout
                            
                                Why I cannot use keyword argument on range function?
                            
                                Django Admin: change select box for foreign key to search autocomplete, like search objects
                            
                                'NoneType' object has no attribute '_app_data' in scrapy\twisted\openssl
                            
                                DRF test client unable to post list of JSON

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

get indices of original text from nltk word_tokenize

Tags:

python

text

tokenize

nltk

genekogan

People also ask

2 Answers

b3000

Emilia Apostolova

Recent Activity

Donate For Us