How to read csv file with column containing unqouted newlines?

Question

library(tidyverse)

I have this stupid csv file where someone forgot to quote the string in the last column which may contain newlines. (The 2nd row in this example doesn’t)

csv_file <- str_c(
  "a,b,c
",
  "1,1,first
row
",
  "1,1,second row
",
  "1,1,third
row
",
  collapse = ""
)

Trying to read the file with read_csv() gives rise to problems.

read_csv(csv_file)
#> Warning: One or more parsing issues, call `problems()` on your data frame for details,
#> e.g.:
#>   dat <- vroom(...)
#>   problems(dat)
#> Rows: 5 Columns: 3
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (2): a, c
#> dbl (1): b
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 5 × 3
#>   a         b c         
#>   <chr> <dbl> <chr>     
#> 1 1         1 first     
#> 2 row      NA <NA>      
#> 3 1         1 second row
#> 4 1         1 third     
#> 5 row      NA <NA>

How can one parse the file to get the expected results?

tibble(
  a = c(1, 1, 1),
  b = c(1, 1, 1),
  c = c("first
row", "second row", "third
row")
)
#> # A tibble: 3 × 3
#>       a     b c           
#>   <dbl> <dbl> <chr>       
#> 1     1     1 "first
row"
#> 2     1     1 "second row"
#> 3     1     1 "third
row"

^{Created on 2023-01-06 with reprex v2.0.2}

Allan Cameron · Accepted Answer

As long as the first column doesn't contain commas, one method would be to replace all newlines with a tab character, and then use gsub to replace any tab that isn't followed by another tab before the next comma. Once you have read in the data, you can easily gsub the tabs in the final column back into newlines:

read.csv(text = gsub('	([[:alnum:]]+),', '
\1,', 
                     gsub('
', '	', csv_file))) %>%
  as_tibble() %>%
  mutate(c = gsub('	', '
', c))
#> # A tibble: 3 x 3
#>       a     b c             
#>   <int> <int> <chr>         
#> 1     1     1 "first
row"  
#> 2     1     1 "second row"  
#> 3     1     1 "third
row
"

Once you have read the data in you can easily gsub the tabs in the final column into newlines

Or alternatively with stringr

csv_file |> 
  str_replace_all(
    regex("
(\w+)(?=
)", multiline = TRUE), "|\1"
  ) |> 
  read_csv() |> 
  mutate(
    c = str_replace(c, "[|]", "
")
  )

Ritchie Sacramento · Answer

Another option is to use readLines() to read in the data, build an index of fields where there are no commas and use that to concatenate the fields together.

txt <- readLines(textConnection(csv_file))
idx <- cumsum(grepl(",", txt))
readr::read_csv(I(tapply(txt, idx, paste, collapse = " ")), show_col_types = FALSE)

# A tibble: 3 × 3                                                                                                                                  
      a     b c         
  <dbl> <dbl> <chr>     
1     1     1 first row 
2     1     1 second row
3     1     1 third row

How to read csv file with column containing unqouted newlines?

Tags:

r

readr

Peter H.

2 Answers

Allan Cameron

Ritchie Sacramento

Recent Activity

Donate For Us

How to read csv file with column containing unqouted newlines?

Tags:

r

readr

Peter H.

2 Answers

Allan Cameron

Ritchie Sacramento

Related questions

Recent Activity

Donate For Us