Tidying datasets with multiple sections/headers at variable positions

Question

Context

I am trying to read in and tidy an excel file with multiple headers/sections placed at variable positions. The content of these headers need to be added as a variable. The input files are relatively large excel files which are formatted with (human) readability in mind but little more than that.

Input:

Let's say the data set contains the distributions of types of car (based on the fuel they use) for a number of cities. As you will see, in the original file, the name of the city is used as header (or divider as you will). We need this header as a variable. Unfortunately not all types are listed and some values are missing. Here's a fictional example set:

 df <- data.frame(
        col1= c("Seattle","Diesel","Gasoline","LPG","Electric","Boston","Diesel","Gasoline","Electric"),
        col2= c(NA, 80 ,NA,10,10,NA,65,25,10)
 )

      col1 col2
1  Seattle   NA
2   Diesel   80
3 Gasoline   NA
4      LPG   10
5 Electric   10
6   Boston   NA
7   Diesel   65
8 Gasoline   25
9 Electric   10

Desired result:

     city     type value
1 Seattle   Diesel    80
2 Seattle Gasoline    NA
3 Seattle      LPG    10
4 Seattle Electric    10
5  Boston   Diesel    65
6  Boston Gasoline    25
7  Boston Electric    10

My attempt:

The closest I got was using dplyr's dense_rank() and lag() but this was not an ideal solution.

Any input is greatly appreciated!

camille · Accepted Answer

Assuming you have a finite list of measures (diesel, electric, etc), you can make a list to check against. Any value of col1 not in that set of measures is presumably a city. Extract those (note that it's currently a factor, so I used as.character), fill down, and remove any heading rows.

library(dplyr)

meas <- c("Diesel", "Gasoline", "LPG", "Electric")

df %>%
  mutate(city = ifelse(!col1 %in% meas, as.character(col1), NA)) %>%
  tidyr::fill(city) %>%
  filter(col1 != city)
#>       col1 col2    city
#> 1   Diesel   80 Seattle
#> 2 Gasoline   NA Seattle
#> 3      LPG   10 Seattle
#> 4 Electric   10 Seattle
#> 5   Diesel   65  Boston
#> 6 Gasoline   25  Boston
#> 7 Electric   10  Boston

Tidying datasets with multiple sections/headers at variable positions

Tags:

r

dplyr

Fheylen

1 Answers

camille

Recent Activity

Donate For Us

Tidying datasets with multiple sections/headers at variable positions

Tags:

r

dplyr

Fheylen

1 Answers

camille

Related questions

Recent Activity

Donate For Us