Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Run nltk sent_tokenize through Pandas dataframe

I have a dataframe that consists of two columns: ID and TEXT. Pretend data is below:

ID      TEXT
265     The farmer plants grain. The fisher catches tuna.
456     The sky is blue.
434     The sun is bright.
921     I own a phone. I own a book.

I know all nltk functions do not work on dataframes. How could sent_tokenize be applied to the above dataframe?

When I try:

df.TEXT.apply(nltk.sent_tokenize)  

The output is unchanged from the original dataframe. My desired output is:

TEXT
The farmer plants grain.
The fisher catches tuna.
The sky is blue.
The sun is bright.
I own a phone.
I own a book.

In addition, I would like to tie back this new (desired) dataframe to the original ID numbers like this (following further text cleansing):

ID    TEXT
265     'farmer', 'plants', 'grain'
265     'fisher', 'catches', 'tuna'
456     'sky', 'blue'
434     'sun', 'bright'
921     'I', 'own', 'phone'
921     'I', 'own', 'book'

This question is related to another of my questions here. Please let me know if I can provide anything to help clarify my question!

like image 737
OverflowingTheGlass Avatar asked Mar 10 '23 02:03

OverflowingTheGlass


1 Answers

edit: as a result of warranted prodding by @alexis here is a better response

Sentence Tokenization

This should get you a DataFrame with one row for each ID & sentence:

sentences = []
for row in df.itertuples():
    for sentence in row[2].split('.'):
        if sentence != '':
            sentences.append((row[1], sentence))
new_df = pandas.DataFrame(sentences, columns=['ID', 'SENTENCE'])

Whose output looks like this:

enter image description here

split('.') will quickly break strings up into sentences if sentences are in fact separated by periods and periods are not being used for other things (e.g. denoting abbreviations), and will remove periods in the process. This will fail if there are multiple use cases for periods and/or not all sentence endings are denoted by periods. A slower but much more robust approach would be to use, as you had asked, sent_tokenize to split rows up by sentence:

sentences = []
for row in df.itertuples():
    for sentence in sent_tokenize(row[2]):
        sentences.append((row[1], sentence))
new_df = pandas.DataFrame(sentences, columns=['ID', 'SENTENCE'])

This produces the following output:

enter image description here

If you want to quickly remove periods from these lines you could do something like:

new_df['SENTENCE_noperiods'] = new_df.SENTENCE.apply(lambda x: x.strip('.'))

Which would yield:

enter image description here

You can also take the apply -> map approach (df is your original table):

df = df.join(df.TEXT.apply(sent_tokenize).rename('SENTENCES'))

Yielding:

enter image description here

Continuing:

sentences = df.SENTENCES.apply(pandas.Series)
sentences.columns = ['sentence {}'.format(n + 1) for n in sentences.columns]

This yields:

enter image description here

As our indices have not changed, we can join this back into our original table:

df = df.join(sentences)

enter image description here

Word Tokenization

Continuing with df from above, we can extract the tokens in a given sentence as follows:

df['sent_1_words'] = df['sentence 1'].apply(word_tokenize)

enter image description here

like image 177
abe Avatar answered Mar 19 '23 23:03

abe