Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Split strings into smaller ones to create new rows in a data frame (in R)

Tags:

string

r

dplyr

I'm a newish R user and I'm currently struggling with how to split strings in each row of a data frame and then create a new row with the modified string (along with modifying the original). This is an example below but the actual data set is much bigger.

library(dplyr)
library(stringr)
library(tidyverse)
library(utils)

posts_sentences <- data.frame("element_id" = c(1, 1, 2, 2, 2), "sentence_id" = c(1, 2, 1, 2, 3), 
                "sentence" = c("You know, when I grew up, I grew up in a very religious family, I had the same sought of troubles people have, I was excelling in alot of ways, but because there was alot of trouble at home, we were always moving around", "Im at breaking point.I have no one to talk to about this and if I’m honest I think I’m too scared to tell anyone because if I do then it becomes real.I dont know what to do.", "I feel like I’m going to explode.", "I have so many thoughts and feelings inside and I don't know who to tell and I was going to tell my friend about it but I'm not sure.", "I keep saying omg!it's too much"), 
                "sentence_wc" = c(60, 30, 7, 20, 7), stringsAsFactors=FALSE)

I want to break up the sentences that are above a certain word count (15 for this data set), create new sentences from within the longer sentences using regex so that first I try to break it up by periods (or other symbols), then if the word count is still too long, I try commas followed by an I (or capital letter) and then I try 'and' followed by a capital letter, etc. Every time I create a new sentence, it needs to change the sentence from the old row to just the first part of the sentence along with changing the word count (I have a function for this) along with creating a new row with the same element id, a sentence id that comes next the sequence (if sentence_id was 1, now new sentence is 2), the new sentence word count and then changing all the below sentences to be the next sentence_id number.

I have been working on this for a few days and can't figure out how to do it. I've tried using unnest tokens, str_split/extract and various dplyr combinations of filter, mutate, etc along with google/SO searches. Does anyone know of the best way to accomplish this? Dplyr is preferred but I'm open to anything that works. Feel free to ask questions if you need any clarification!

Edit to add the expected output data frame:

expected_output <- data.frame("element_id" = c(1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2), "sentence_id" = c(1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6), 
                                   "sentence" = c("You know, when I grew up", "I grew up in a very religious family", "I had the same sought of troubles people have", "I was excelling in alot of ways, but because there was alot of trouble at home, we were always moving around", "Im at breaking point.", "I have no one to talk to about this and if I’m honest I think I’m too scared to tell anyone because if I do then it becomes real.", "I dont know what to do.", "I feel like I’m going to explode.", "I have so many thoughts and feelings inside and", "I don't know who to tell and", "I was going to tell my friend about it but I'm not sure.", "I keep saying omg!", "it's too much"), 
                                   "sentence_wc" = c(6, 8, 8, 21, 4, 27, 6, 7, 9, 7, 13, 4, 3), stringsAsFactors=FALSE)
like image 617
Ashley Avatar asked Jan 29 '23 02:01

Ashley


1 Answers

Here is a tidyverse approach that allows you to specify your own heuristics, which I think should be the best for your situation. The key is the use of pmap to create lists of each row that you can then split if necessary with map_if. This is a situation that is hard to do with dplyr alone in my opinion, because we're adding rows in our operation and so rowwise is hard to use.

The structure of split_too_long() is basically:

  1. Use dplyr::mutate and tokenizers::count_words to get the word count of each sentence
  2. make each row an element of a list with purrr::pmap, which accepts the dataframe as a list of columns as input
  3. use purrr::map_if to check if the word count is greater than our desired limit
  4. use tidyr::separate_rows to split the sentence into multiple rows if the above condition is met,
  5. then replace the word count with the new word count and drop any empty rows with filter (created by doubled up separators).

We can then apply this for different separators as we realise that the elements need to be split further. Here I use these patterns corresponding to the heuristics you mention:

  • "[\\.\\?\\!] ?" which matches any of .!? and an optional space
  • ", ?(?=[:upper:])" which matches ,, optional space, preceding an uppercase letter
  • "and ?(?=[:upper:])" which matches and optional space, preceding an uppercase letter.

It correctly returns the same split sentences as in your expected output. The sentence_id is easy to add back in at the end with row_number, and errant leading/trailing whitespace can be removed with stringr::str_trim.

Caveats:

  • I wrote this for readability in exploratory analysis, hence splitting into the lists and binding back together each time. If you decide in advance what separators you want you can put it into one map step which would probably make it faster, though I haven't profiled this on a large dataset.
  • As per comments, there are still sentences with more than 15 words after these splits. You will have to decide what additional symbols/regular expressions you want to split on to get the lengths down more.
  • The column names are hardcoded into split_too_long at present. I recommend you look into the programming with dplyr vignette if being able to specify column names in the call to the function is important to you (it should only be a few tweaks to achieve it)
posts_sentences <- data.frame(
  "element_id" = c(1, 1, 2, 2, 2), "sentence_id" = c(1, 2, 1, 2, 3),
  "sentence" = c("You know, when I grew up, I grew up in a very religious family, I had the same sought of troubles people have, I was excelling in alot of ways, but because there was alot of trouble at home, we were always moving around", "Im at breaking point.I have no one to talk to about this and if I’m honest I think I’m too scared to tell anyone because if I do then it becomes real.I dont know what to do.", "I feel like I’m going to explode.", "I have so many thoughts and feelings inside and I don't know who to tell and I was going to tell my friend about it but I'm not sure.", "I keep saying omg!it's too much"),
  "sentence_wc" = c(60, 30, 7, 20, 7), stringsAsFactors = FALSE
)

library(tidyverse)
library(tokenizers)
split_too_long <- function(df, regexp, max_length) {
  df %>%
    mutate(wc = count_words(sentence)) %>%
    pmap(function(...) tibble(...)) %>%
    map_if(
      .p = ~ .$wc > max_length,
      .f = ~ separate_rows(., sentence, sep = regexp)
      ) %>%
    bind_rows() %>%
    mutate(wc = count_words(sentence)) %>%
    filter(wc != 0)
}

posts_sentences %>%
  group_by(element_id) %>%
  summarise(sentence = str_c(sentence, collapse = ".")) %>%
  ungroup() %>%
  split_too_long("[\\.\\?\\!] ?", 15) %>%
  split_too_long(", ?(?=[:upper:])", 15) %>%
  split_too_long("and ?(?=[:upper:])", 15) %>%
  group_by(element_id) %>%
  mutate(
    sentence = str_trim(sentence),
    sentence_id = row_number()
  ) %>%
  select(element_id, sentence_id, sentence, wc)
#> # A tibble: 13 x 4
#> # Groups:   element_id [2]
#>    element_id sentence_id sentence                                      wc
#>         <dbl>       <int> <chr>                                      <int>
#>  1          1           1 You know, when I grew up                       6
#>  2          1           2 I grew up in a very religious family           8
#>  3          1           3 I had the same sought of troubles people ~     9
#>  4          1           4 I was excelling in alot of ways, but beca~    21
#>  5          1           5 Im at breaking point                           4
#>  6          1           6 I have no one to talk to about this and i~    29
#>  7          1           7 I dont know what to do                         6
#>  8          2           1 I feel like I’m going to explode               7
#>  9          2           2 I have so many thoughts and feelings insi~     8
#> 10          2           3 I don't know who to tell                       6
#> 11          2           4 I was going to tell my friend about it bu~    13
#> 12          2           5 I keep saying omg                              4
#> 13          2           6 it's too much                                  3

Created on 2018-05-21 by the reprex package (v0.2.0).

like image 108
Calum You Avatar answered Feb 02 '23 05:02

Calum You