Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to read csv file with column containing unqouted newlines?

Tags:

r

readr

library(tidyverse)

I have this stupid csv file where someone forgot to quote the string in the last column which may contain newlines. (The 2nd row in this example doesn’t)

csv_file <- str_c(
  "a,b,c\n",
  "1,1,first\nrow\n",
  "1,1,second row\n",
  "1,1,third\nrow\n",
  collapse = ""
)

Trying to read the file with read_csv() gives rise to problems.

read_csv(csv_file)
#> Warning: One or more parsing issues, call `problems()` on your data frame for details,
#> e.g.:
#>   dat <- vroom(...)
#>   problems(dat)
#> Rows: 5 Columns: 3
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (2): a, c
#> dbl (1): b
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 5 × 3
#>   a         b c         
#>   <chr> <dbl> <chr>     
#> 1 1         1 first     
#> 2 row      NA <NA>      
#> 3 1         1 second row
#> 4 1         1 third     
#> 5 row      NA <NA>

How can one parse the file to get the expected results?

tibble(
  a = c(1, 1, 1),
  b = c(1, 1, 1),
  c = c("first\nrow", "second row", "third\nrow")
)
#> # A tibble: 3 × 3
#>       a     b c           
#>   <dbl> <dbl> <chr>       
#> 1     1     1 "first\nrow"
#> 2     1     1 "second row"
#> 3     1     1 "third\nrow"

Created on 2023-01-06 with reprex v2.0.2

like image 842
Peter H. Avatar asked Oct 12 '25 00:10

Peter H.


2 Answers

As long as the first column doesn't contain commas, one method would be to replace all newlines with a tab character, and then use gsub to replace any tab that isn't followed by another tab before the next comma. Once you have read in the data, you can easily gsub the tabs in the final column back into newlines:

read.csv(text = gsub('\t([[:alnum:]]+),', '\n\\1,', 
                     gsub('\n', '\t', csv_file))) %>%
  as_tibble() %>%
  mutate(c = gsub('\t', '\n', c))
#> # A tibble: 3 x 3
#>       a     b c             
#>   <int> <int> <chr>         
#> 1     1     1 "first\nrow"  
#> 2     1     1 "second row"  
#> 3     1     1 "third\nrow\n"

Once you have read the data in you can easily gsub the tabs in the final column into newlines


Or alternatively with stringr

csv_file |> 
  str_replace_all(
    regex("\n(\\w+)(?=\n)", multiline = TRUE), "|\\1"
  ) |> 
  read_csv() |> 
  mutate(
    c = str_replace(c, "[|]", "\n")
  )
like image 152
Allan Cameron Avatar answered Oct 14 '25 12:10

Allan Cameron


Another option is to use readLines() to read in the data, build an index of fields where there are no commas and use that to concatenate the fields together.

txt <- readLines(textConnection(csv_file))
idx <- cumsum(grepl(",", txt))
readr::read_csv(I(tapply(txt, idx, paste, collapse = " ")), show_col_types = FALSE)

# A tibble: 3 × 3                                                                                                                                  
      a     b c         
  <dbl> <dbl> <chr>     
1     1     1 first row 
2     1     1 second row
3     1     1 third row 
like image 42
Ritchie Sacramento Avatar answered Oct 14 '25 14:10

Ritchie Sacramento



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!