Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to separate a sentence into words [duplicate]

Tags:

r

words

sentence

In r, I'm currently working with datasets of conversations. The data currently looks like the following:

Mike, "Hello how are you"
Sally, "Good you"

I plan to eventually create a word cloud of this data and would need it to look like this:

Mike, Hello
Mike, how
Mike, are
Mike, you
Sally, good
Sally, you
like image 348
Bradley Erickson Avatar asked Dec 09 '25 17:12

Bradley Erickson


2 Answers

Perhaps something like this using reshape2::melt?

# Sample data
df <- read.csv(text =
    'Mike, "Hello how are you"
    Sally, "Good you"', header = F)

# Split on words
lst <- strsplit(trimws(as.character(df[, 2])), "\\s");
names(lst) <- trimws(df[, 1]);

# Reshape into long dataframe 
library(reshape2);
df.long <- (melt(lst))[2:1];
#     L1 value
#1  Mike Hello
#2  Mike   how
#3  Mike   are
#4  Mike   you
#5 Sally  Good
#6 Sally   you

Explanation: Split trailing/leading whitespace-trimmed (trimws) entries in second column on whitespace \\s and store in list. Take list entry names from first column, and reshape into a long data.frame using reshape2::melt.

I leave turning this into a comma-separated data.frame up to you...

like image 154
Maurits Evers Avatar answered Dec 11 '25 10:12

Maurits Evers


Use a tokenizer, e.g. via tidytext::unnest_tokens:

library(tidyverse)
library(tidytext)

dialogue <- read_csv(
    'Mike, "Hello how are you"
     Sally, "Good you"', 
    col_names = c('speaker', 'sentence')
)

dialogue %>% unnest_tokens(word, sentence)
#> # A tibble: 6 x 2
#>   speaker  word
#>     <chr> <chr>
#> 1    Mike hello
#> 2    Mike   how
#> 3    Mike   are
#> 4    Mike   you
#> 5   Sally  good
#> 6   Sally   you
like image 31
alistaire Avatar answered Dec 11 '25 10:12

alistaire



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!