I'm a newish R user and I'm currently struggling with how to split strings in each row of a data frame and then create a new row with the modified string (along with modifying the original). This is an example below but the actual data set is much bigger.
library(dplyr)
library(stringr)
library(tidyverse)
library(utils)
posts_sentences <- data.frame("element_id" = c(1, 1, 2, 2, 2), "sentence_id" = c(1, 2, 1, 2, 3),
"sentence" = c("You know, when I grew up, I grew up in a very religious family, I had the same sought of troubles people have, I was excelling in alot of ways, but because there was alot of trouble at home, we were always moving around", "Im at breaking point.I have no one to talk to about this and if I’m honest I think I’m too scared to tell anyone because if I do then it becomes real.I dont know what to do.", "I feel like I’m going to explode.", "I have so many thoughts and feelings inside and I don't know who to tell and I was going to tell my friend about it but I'm not sure.", "I keep saying omg!it's too much"),
"sentence_wc" = c(60, 30, 7, 20, 7), stringsAsFactors=FALSE)
I want to break up the sentences that are above a certain word count (15 for this data set), create new sentences from within the longer sentences using regex so that first I try to break it up by periods (or other symbols), then if the word count is still too long, I try commas followed by an I (or capital letter) and then I try 'and' followed by a capital letter, etc. Every time I create a new sentence, it needs to change the sentence from the old row to just the first part of the sentence along with changing the word count (I have a function for this) along with creating a new row with the same element id, a sentence id that comes next the sequence (if sentence_id was 1, now new sentence is 2), the new sentence word count and then changing all the below sentences to be the next sentence_id number.
I have been working on this for a few days and can't figure out how to do it. I've tried using unnest tokens, str_split/extract and various dplyr combinations of filter, mutate, etc along with google/SO searches. Does anyone know of the best way to accomplish this? Dplyr is preferred but I'm open to anything that works. Feel free to ask questions if you need any clarification!
Edit to add the expected output data frame:
expected_output <- data.frame("element_id" = c(1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2), "sentence_id" = c(1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6),
"sentence" = c("You know, when I grew up", "I grew up in a very religious family", "I had the same sought of troubles people have", "I was excelling in alot of ways, but because there was alot of trouble at home, we were always moving around", "Im at breaking point.", "I have no one to talk to about this and if I’m honest I think I’m too scared to tell anyone because if I do then it becomes real.", "I dont know what to do.", "I feel like I’m going to explode.", "I have so many thoughts and feelings inside and", "I don't know who to tell and", "I was going to tell my friend about it but I'm not sure.", "I keep saying omg!", "it's too much"),
"sentence_wc" = c(6, 8, 8, 21, 4, 27, 6, 7, 9, 7, 13, 4, 3), stringsAsFactors=FALSE)
Here is a tidyverse
approach that allows you to specify your own heuristics, which I think should be the best for your situation. The key is the use of pmap
to create lists of each row that you can then split if necessary with map_if
. This is a situation that is hard to do with dplyr
alone in my opinion, because we're adding rows in our operation and so rowwise
is hard to use.
The structure of split_too_long()
is basically:
dplyr::mutate
and tokenizers::count_words
to get the word count of each sentencepurrr::pmap
, which accepts the dataframe as a list of columns as inputpurrr::map_if
to check if the word count is greater than our desired limittidyr::separate_rows
to split the sentence into multiple rows if the above condition is met,filter
(created by doubled up separators).We can then apply this for different separators as we realise that the elements need to be split further. Here I use these patterns corresponding to the heuristics you mention:
"[\\.\\?\\!] ?"
which matches any of .!?
and an optional space", ?(?=[:upper:])"
which matches ,
, optional space, preceding an uppercase letter"and ?(?=[:upper:])"
which matches and
optional space, preceding an uppercase letter.It correctly returns the same split sentences as in your expected output. The sentence_id
is easy to add back in at the end with row_number
, and errant leading/trailing whitespace can be removed with stringr::str_trim
.
Caveats:
map
step which would probably make it faster, though I haven't profiled this on a large dataset.split_too_long
at present. I recommend you look into the programming with dplyr
vignette if being able to specify column names in the call to the function is important to you (it should only be a few tweaks to achieve it)posts_sentences <- data.frame(
"element_id" = c(1, 1, 2, 2, 2), "sentence_id" = c(1, 2, 1, 2, 3),
"sentence" = c("You know, when I grew up, I grew up in a very religious family, I had the same sought of troubles people have, I was excelling in alot of ways, but because there was alot of trouble at home, we were always moving around", "Im at breaking point.I have no one to talk to about this and if I’m honest I think I’m too scared to tell anyone because if I do then it becomes real.I dont know what to do.", "I feel like I’m going to explode.", "I have so many thoughts and feelings inside and I don't know who to tell and I was going to tell my friend about it but I'm not sure.", "I keep saying omg!it's too much"),
"sentence_wc" = c(60, 30, 7, 20, 7), stringsAsFactors = FALSE
)
library(tidyverse)
library(tokenizers)
split_too_long <- function(df, regexp, max_length) {
df %>%
mutate(wc = count_words(sentence)) %>%
pmap(function(...) tibble(...)) %>%
map_if(
.p = ~ .$wc > max_length,
.f = ~ separate_rows(., sentence, sep = regexp)
) %>%
bind_rows() %>%
mutate(wc = count_words(sentence)) %>%
filter(wc != 0)
}
posts_sentences %>%
group_by(element_id) %>%
summarise(sentence = str_c(sentence, collapse = ".")) %>%
ungroup() %>%
split_too_long("[\\.\\?\\!] ?", 15) %>%
split_too_long(", ?(?=[:upper:])", 15) %>%
split_too_long("and ?(?=[:upper:])", 15) %>%
group_by(element_id) %>%
mutate(
sentence = str_trim(sentence),
sentence_id = row_number()
) %>%
select(element_id, sentence_id, sentence, wc)
#> # A tibble: 13 x 4
#> # Groups: element_id [2]
#> element_id sentence_id sentence wc
#> <dbl> <int> <chr> <int>
#> 1 1 1 You know, when I grew up 6
#> 2 1 2 I grew up in a very religious family 8
#> 3 1 3 I had the same sought of troubles people ~ 9
#> 4 1 4 I was excelling in alot of ways, but beca~ 21
#> 5 1 5 Im at breaking point 4
#> 6 1 6 I have no one to talk to about this and i~ 29
#> 7 1 7 I dont know what to do 6
#> 8 2 1 I feel like I’m going to explode 7
#> 9 2 2 I have so many thoughts and feelings insi~ 8
#> 10 2 3 I don't know who to tell 6
#> 11 2 4 I was going to tell my friend about it bu~ 13
#> 12 2 5 I keep saying omg 4
#> 13 2 6 it's too much 3
Created on 2018-05-21 by the reprex package (v0.2.0).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With