I have a dataframe that consists of two columns: ID and TEXT. Pretend data is below:
ID TEXT
265 The farmer plants grain. The fisher catches tuna.
456 The sky is blue.
434 The sun is bright.
921 I own a phone. I own a book.
I know all nltk functions do not work on dataframes. How could sent_tokenize be applied to the above dataframe?
When I try:
df.TEXT.apply(nltk.sent_tokenize)
The output is unchanged from the original dataframe. My desired output is:
TEXT
The farmer plants grain.
The fisher catches tuna.
The sky is blue.
The sun is bright.
I own a phone.
I own a book.
In addition, I would like to tie back this new (desired) dataframe to the original ID
numbers like this (following further text cleansing):
ID TEXT
265 'farmer', 'plants', 'grain'
265 'fisher', 'catches', 'tuna'
456 'sky', 'blue'
434 'sun', 'bright'
921 'I', 'own', 'phone'
921 'I', 'own', 'book'
This question is related to another of my questions here. Please let me know if I can provide anything to help clarify my question!
edit: as a result of warranted prodding by @alexis here is a better response
Sentence Tokenization
This should get you a DataFrame with one row for each ID & sentence:
sentences = []
for row in df.itertuples():
for sentence in row[2].split('.'):
if sentence != '':
sentences.append((row[1], sentence))
new_df = pandas.DataFrame(sentences, columns=['ID', 'SENTENCE'])
Whose output looks like this:
split('.') will quickly break strings up into sentences if sentences are in fact separated by periods and periods are not being used for other things (e.g. denoting abbreviations), and will remove periods in the process. This will fail if there are multiple use cases for periods and/or not all sentence endings are denoted by periods. A slower but much more robust approach would be to use, as you had asked, sent_tokenize to split rows up by sentence:
sentences = []
for row in df.itertuples():
for sentence in sent_tokenize(row[2]):
sentences.append((row[1], sentence))
new_df = pandas.DataFrame(sentences, columns=['ID', 'SENTENCE'])
This produces the following output:
If you want to quickly remove periods from these lines you could do something like:
new_df['SENTENCE_noperiods'] = new_df.SENTENCE.apply(lambda x: x.strip('.'))
Which would yield:
You can also take the apply -> map approach (df is your original table):
df = df.join(df.TEXT.apply(sent_tokenize).rename('SENTENCES'))
Yielding:
Continuing:
sentences = df.SENTENCES.apply(pandas.Series)
sentences.columns = ['sentence {}'.format(n + 1) for n in sentences.columns]
This yields:
As our indices have not changed, we can join this back into our original table:
df = df.join(sentences)
Word Tokenization
Continuing with df from above, we can extract the tokens in a given sentence as follows:
df['sent_1_words'] = df['sentence 1'].apply(word_tokenize)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With