read_csv() parsing error message, how to interpret?

Tags:

I am in the middle of parsing in a large amount of csv data. The data is rather "dirty" in that I have inconsistent delimiters, spurious characters and format issues that cause problems for read_csv().

My problem here, however, is not the dirtiness of the data, but just trying to understand the parsing errors that read_csv() is giving me. If I can better understand the error messages, I can then do some janitorial work to fix the problem with scripts. The size of the data makes a manual approach intractable.

Here's a minimal example. Suppose I have a csv file like this:

Click to copy

"col_a","col_b","col_c"
"1","a quick","10"
"2","a quick "brown" fox","20"
"3","quick, brown fox","30"

Note that there's spurious quotes around "brown" in the 2nd row. This content goes into a file called "my_data.csv".

When I try to read that file, I get some parsing failures.

Click to copy

> library(tidyverse)
> df <- read_csv("./my_data.csv", col_types = cols(.default = "c"))
Warning: 2 parsing failures.
row # A tibble: 2 x 5 col     row   col           expected actual            file expected   <int> <chr>              <chr>  <chr>           <chr> actual 1     2 col_b delimiter or quote      b './my_data.csv' file 2     2 col_b delimiter or quote        './my_data.csv'

As you can see, the parsing failure has not been "pretty printed". It is ONE LONG LINE of 271 characters.

I can't figure out where to even put linebreaks in the failure message to see where the problem is and what the message is trying to tell me. Moreover, it refers to a "2x5 tibble". What tibble? My data frame is 3x3.

Can someone show me how to format or put linebreaks in the message from read_csv() so I can see how it is detecting the problem?

Yes, I know what the problem is in this particular minimal example. In my actual data I am dealing with large amounts of csv (~1M rows), peppered with inconsistencies that shower me with hundreds of parsing failures. I'd like to setup a workflow for categorizing these and dealing with them programmatically. The first step, I think, is just understanding how to "parse" the parsing failure message.

522

asked Oct 16 '17 20:10

Angelo

1 Answers

After taking a breath and looking at the actual documentation, I see there is a way to get the parsing failures from read_csv() in a form that is very usable.

All I had to do to get the parsing failures was to use problems().

Click to copy

> library(tidyverse)
> df <- read_csv("./my_data.csv", col_types = cols(.default = "c"))
Warning: 2 parsing failures.
row # A tibble: 2 x 5 col     row   col           expected actual            file expected   <int> <chr>              <chr>  <chr>           <chr> actual 1     2 col_b delimiter or quote      b './my_data.csv' file 2     2 col_b delimiter or quote        './my_data.csv'

> parsing_failures <- problems(df)
> parsing_failures
# A tibble: 2 x 5
    row   col           expected actual            file
  <int> <chr>              <chr>  <chr>           <chr>
1     2 col_b delimiter or quote      b './my_data.csv'
2     2 col_b delimiter or quote        './my_data.csv'

Apparently read_csv() associates a tibble containing parsing failure details and this is accessible by passing the result from read_csv to problems().

193

answered Oct 14 '22 14:10

Angelo

Related questions
                            
                                How to make a ggplot2 contour plot analogue to lattice:filled.contour()?
                            
                                ggplot2: Have shorter tick marks for tick marks without labels
                            
                                How to render HTML from RMarkdown without javascript in output
                            
                                How to make R legend with 2 columns?
                            
                                Stargazer: Save to file, don't show in console
                            
                                How to pre-select rows in Shiny DT datatables
                            
                                a vector to an upper Triangle matrix by row in R
                            
                                "circular" mean in R
                            
                                Add sparkline graph to a table
                            
                                How to make code chunks depend on all previous chunks in knitr/rmarkdown?
                            
                                Create a gif from a series of Leaflet maps in R
                            
                                Print a list of dynamically-sized plots in knitr
                            
                                How to get correct order of tip labels in APE after calling ladderize function
                            
                                Add discrete labels to ggplot2 plot with continuous scale
                            
                                Space between gpplot2 horizontal legend elements
                            
                                Add multiple lines to a plot_ly graph with add_trace
                            
                                Forcing R (and Rstudio) to use the virtual memory on Windows
                            
                                R: Exit from the calling function
                            
                                Find time to nearest occurrence of particular value for each row
                            
                                How to make plotly axes display only integer numbers

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

read_csv() parsing error message, how to interpret?

Tags:

parsing

r

csv

tidyverse

readr

Angelo

People also ask

1 Answers

Angelo

Recent Activity

Donate For Us