how to use word_tokenize in data frame

Tags:

I have recently started using the nltk module for text analysis. I am stuck at a point. I want to use word_tokenize on a dataframe, so as to obtain all the words used in a particular row of the dataframe.

data example:
       text
1.   This is a very good site. I will recommend it to others.
2.   Can you please give me a call at 9983938428. have issues with the listings.
3.   good work! keep it up
4.   not a very helpful site in finding home decor. 

expected output:

1.   'This','is','a','very','good','site','.','I','will','recommend','it','to','others','.'
2.   'Can','you','please','give','me','a','call','at','9983938428','.','have','issues','with','the','listings'
3.   'good','work','!','keep','it','up'
4.   'not','a','very','helpful','site','in','finding','home','decor'

Basically, i want to separate all the words and find the length of each text in the dataframe.

I know word_tokenize can for it for a string, but how to apply it onto the entire dataframe?

Please help!

Thanks in advance...

708

asked Oct 13 '15 08:10

eclairs

1 Answers

You can use apply method of DataFrame API:

import pandas as pd
import nltk

df = pd.DataFrame({'sentences': ['This is a very good site. I will recommend it to others.', 'Can you please give me a call at 9983938428. have issues with the listings.', 'good work! keep it up']})
df['tokenized_sents'] = df.apply(lambda row: nltk.word_tokenize(row['sentences']), axis=1)

Output:

>>> df
                                           sentences  \
0  This is a very good site. I will recommend it ...   
1  Can you please give me a call at 9983938428. h...   
2                              good work! keep it up   

                                     tokenized_sents  
0  [This, is, a, very, good, site, ., I, will, re...  
1  [Can, you, please, give, me, a, call, at, 9983...  
2                      [good, work, !, keep, it, up]

For finding the length of each text try to use apply and lambda function again:

df['sents_length'] = df.apply(lambda row: len(row['tokenized_sents']), axis=1)

>>> df
                                           sentences  \
0  This is a very good site. I will recommend it ...   
1  Can you please give me a call at 9983938428. h...   
2                              good work! keep it up   

                                     tokenized_sents  sents_length  
0  [This, is, a, very, good, site, ., I, will, re...            14  
1  [Can, you, please, give, me, a, call, at, 9983...            15  
2                      [good, work, !, keep, it, up]             6

122

answered Sep 20 '22 16:09

ilyakhov

Related questions
                            
                                Django Import Error: No module named apps
                            
                                pip install dryscrape fails with "error: [Errno 2] No such file or directory: 'src/webkit_server'"?
                            
                                how to NOT read_csv if csv is empty
                            
                                Python scripts in /usr/bin
                            
                                Python not recognising directories os.path.isdir() [duplicate]
                            
                                How do I detect collision in pygame?
                            
                                Installed Nose but cannot use on command line
                            
                                How to configure Atom to run Python3 scripts?
                            
                                Django 2, python 3.4 cannot decode urlsafe_base64_decode(uidb64)
                            
                                Reading/Writing MS Word files in Python
                            
                                Search a list of strings for any sub-string from another list
                            
                                error: Setup script exited with error: command 'gcc' failed with exit status 1
                            
                                Scrapy - logging to file and stdout simultaneously, with spider names
                            
                                combine two arrays and sort
                            
                                get user profile in django
                            
                                Find dictionary keys with duplicate values
                            
                                Kivy does not detect OpenGL 2.0 [closed]
                            
                                Make User email unique django
                            
                                ValueError: no such test method in <class 'myapp.tests.SessionTestCase'>: runTest
                            
                                Conversion of curl to python Requests

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

how to use word_tokenize in data frame

Tags:

python

pandas

nltk

eclairs

People also ask

1 Answers

ilyakhov

Recent Activity

Donate For Us