I have R data frame with hundreds of rows as
word Freq
seed 4
seeds 3
contract 2
contracting 2
river 1
I would like to group the data by patterns, say seed + seeds ... that looks like
word Freq
seed 7
contract 4
river 1
Here is potentially another way to go. In the SnowballC
package, there is a function which cleans up words and get word stems (i.e, wordStem()
). Using that, you can skip string manipulation, I think. Once you get this process done, all you do is to get sum of word frequency.
library(SnowballC)
library(dplyr)
mydf <- read.table(text = "word Freq
seed 4
seeds 3
contract 2
contracting 2
river 1", header = T)
mutate(mydf, word = wordStem(word)) %>%
group_by(word) %>%
summarise(total = sum(Freq))
# word total
# (chr) (int)
#1 contract 4
#2 river 1
#3 seed 7
One option would be to create a grouping variable 'gr' by extracting substring based on the minimum number of characters in 'word', do this one more with 'word' sp that we can get the substring for each group of words, and then get the sum
of 'Freq' by 'word'.
library(dplyr)
df1 %>%
group_by(gr= substr(word, 1, min(nchar(word)))) %>%
group_by(word= substr(word, 1, min(nchar(word)))) %>%
summarise(Freq= sum(Freq))
word Freq
# (chr) (int)
#1 contract 4
#2 river 1
#3 seed 7
Can also do with cross-join, which is a little bit safer than the above method.
library(dplyr)
library(stringi)
df %>%
merge(df %>% select(short_word = word) ) %>%
filter(short_word %>%
stri_detect_regex(word, .) ) %>%
group_by(word) %>%
slice(short_word %>% stri_length %>% which.min) %>%
group_by(short_word) %>%
summarise(Freq= sum(Freq))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With