Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Group data frame by pattern in R

I have R data frame with hundreds of rows as

word        Freq
seed         4
seeds        3
contract     2
contracting  2
river        1

I would like to group the data by patterns, say seed + seeds ... that looks like

word     Freq
seed      7
contract  4
river     1
like image 505
Samuel Shamiri Avatar asked Oct 26 '15 02:10

Samuel Shamiri


3 Answers

Here is potentially another way to go. In the SnowballC package, there is a function which cleans up words and get word stems (i.e, wordStem()). Using that, you can skip string manipulation, I think. Once you get this process done, all you do is to get sum of word frequency.

library(SnowballC)
library(dplyr)

mydf <- read.table(text = "word        Freq
seed         4
seeds        3
contract     2
contracting  2
river        1", header = T)

mutate(mydf, word = wordStem(word)) %>%
group_by(word) %>%
summarise(total = sum(Freq))

#      word total
#     (chr) (int)
#1 contract     4
#2    river     1
#3     seed     7
like image 167
jazzurro Avatar answered Nov 04 '22 00:11

jazzurro


One option would be to create a grouping variable 'gr' by extracting substring based on the minimum number of characters in 'word', do this one more with 'word' sp that we can get the substring for each group of words, and then get the sum of 'Freq' by 'word'.

library(dplyr)
 df1 %>% 
    group_by(gr= substr(word, 1, min(nchar(word)))) %>%
    group_by(word= substr(word, 1, min(nchar(word)))) %>%
    summarise(Freq= sum(Freq)) 
    word  Freq
#      (chr) (int)
#1 contract     4
#2    river     1
#3     seed     7
like image 2
akrun Avatar answered Nov 04 '22 00:11

akrun


Can also do with cross-join, which is a little bit safer than the above method.

library(dplyr)
library(stringi)

df %>%
  merge(df %>% select(short_word = word) ) %>%
  filter(short_word %>%
           stri_detect_regex(word, .) ) %>%
  group_by(word) %>%
  slice(short_word %>% stri_length %>% which.min) %>%
  group_by(short_word) %>%
  summarise(Freq= sum(Freq)) 
like image 1
bramtayl Avatar answered Nov 03 '22 23:11

bramtayl