I am new to R and have a very large irregular column in a data frame like this:
x <- data.frame(section = c("BOOK I: Introduction", "Page one: presentation", "Page two: acknowledgments", "MAGAZINE II: Considerations", "Page one: characters", "Page two: index", "BOOK III: General Principles", "BOOK III: General Principles", "Page one: invitation"))
section
BOOK I: Introduction
Page one: presentation
Page two: acknowledgments
MAGAZINE II: Considerations
Page one: characters
Page two: index
BOOK III: General principles
BOOK III: General principles
Page one: invitation
I need to concatenate this column to look like this:
section
BOOK I: Introduction
BOOK I: Introduction / Page one: presentation
BOOK I: Introduction / Page two: acknowledgments
MAGAZINE II: Considerations
MAGAZINE II: Considerations / Page one: characters
MAGAZINE II: Considerations / Page two: index
BOOK III: General Principles
BOOK III: General Principles
BOOK III: General Principles / Page one: invitation
Basically the goal is to extract the value of the upper string based in a condition and then concatenate with the lower actualizing the value with a regex expression, but I really don't know how to do it.
Thanks in advance.
Here is one method:
x <- data.frame(section = c("BOOK I: Introduction", "Page one: presentation", "Page two: acknowledgments", "MAGAZINE II: Considerations", "Page one: characters", "Page two: index", "BOOK III: General Principles", "BOOK III: General Principles", "Page one: invitation"))
x <- dplyr::mutate(x,
isSection = stringr::str_starts(section, "Page", negate = TRUE),
sectionNum = cumsum(isSection)
) |>
dplyr::group_by(sectionNum) |>
dplyr::mutate(newSection = dplyr::if_else(
condition = isSection,
true = section,
false = paste(dplyr::first(section), section, sep = " / ")
)) |>
ungroup()
x
#> # A tibble: 9 × 4
#> section isSection sectionNum newSection
#> <chr> <lgl> <int> <chr>
#> 1 BOOK I: Introduction TRUE 1 BOOK I: Introduction
#> 2 Page one: presentation FALSE 1 BOOK I: Introduction / Page…
#> 3 Page two: acknowledgments FALSE 1 BOOK I: Introduction / Page…
#> 4 MAGAZINE II: Considerations TRUE 2 MAGAZINE II: Considerations
#> 5 Page one: characters FALSE 2 MAGAZINE II: Considerations…
#> 6 Page two: index FALSE 2 MAGAZINE II: Considerations…
#> 7 BOOK III: General Principles TRUE 3 BOOK III: General Principles
#> 8 BOOK III: General Principles TRUE 4 BOOK III: General Principles
#> 9 Page one: invitation FALSE 4 BOOK III: General Principle…
Created on 2022-03-25 by the reprex package (v2.0.1)
Here, we first determine if the section is a section title or a page title and save that as TRUE or FALSE.
Then, we label the pages belonging to a section by using cumsum() (cumulative sum). When we add up TRUE and FALSE values, TRUE (here, sections) become 1 and increment the cumulative sum, but FALSE (here, pages) become 0 and don't increment the cumulative sum, so all of the pages within a specific section receive the same value.
Lastly, we make a new section variable, this time using group_by() and if_else() to conditionally set the value. If isSection is TRUE, we just keep the existing value of section (the section title). If isSection is FALSE, we concatenate the first value of section from the group with the existing value of section, separated by " / ".
using data.table:
library(data.table)
setDT(x)[grepl("^Page.",section)==F, header:=section] %>%
.[,header:=zoo::na.locf(header)] %>%
.[section!=header,header:=paste0(header, " / ",section)] %>%
.[,.(section = header)] %>%
.[]
1: BOOK I: Introduction
2: BOOK I: Introduction / Page one: presentation
3: BOOK I: Introduction / Page two: acknowledgments
4: MAGAZINE II: Considerations
5: MAGAZINE II: Considerations / Page one: characters
6: MAGAZINE II: Considerations / Page two: index
7: BOOK III: General Principles
8: BOOK III: General Principles
9: BOOK III: General Principles / Page one: invitation
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With