Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

read_csv() parsing error message, how to interpret?

I am in the middle of parsing in a large amount of csv data. The data is rather "dirty" in that I have inconsistent delimiters, spurious characters and format issues that cause problems for read_csv().

My problem here, however, is not the dirtiness of the data, but just trying to understand the parsing errors that read_csv() is giving me. If I can better understand the error messages, I can then do some janitorial work to fix the problem with scripts. The size of the data makes a manual approach intractable.

Here's a minimal example. Suppose I have a csv file like this:

"col_a","col_b","col_c"
"1","a quick","10"
"2","a quick "brown" fox","20"
"3","quick, brown fox","30"

Note that there's spurious quotes around "brown" in the 2nd row. This content goes into a file called "my_data.csv".

When I try to read that file, I get some parsing failures.

> library(tidyverse)
> df <- read_csv("./my_data.csv", col_types = cols(.default = "c"))
Warning: 2 parsing failures.
row # A tibble: 2 x 5 col     row   col           expected actual            file expected   <int> <chr>              <chr>  <chr>           <chr> actual 1     2 col_b delimiter or quote      b './my_data.csv' file 2     2 col_b delimiter or quote        './my_data.csv'

As you can see, the parsing failure has not been "pretty printed". It is ONE LONG LINE of 271 characters.

I can't figure out where to even put linebreaks in the failure message to see where the problem is and what the message is trying to tell me. Moreover, it refers to a "2x5 tibble". What tibble? My data frame is 3x3.

Can someone show me how to format or put linebreaks in the message from read_csv() so I can see how it is detecting the problem?

Yes, I know what the problem is in this particular minimal example. In my actual data I am dealing with large amounts of csv (~1M rows), peppered with inconsistencies that shower me with hundreds of parsing failures. I'd like to setup a workflow for categorizing these and dealing with them programmatically. The first step, I think, is just understanding how to "parse" the parsing failure message.

like image 522
Angelo Avatar asked Oct 16 '17 20:10

Angelo


People also ask

How to Pars CSV files in Python?

Parsing CSV files in Python is quite easy. Python has an inbuilt CSV library which provides the functionality of both readings and writing the data from and to CSV files. There are a variety of formats available for CSV files in the library which makes data processing user-friendly. Reading CSV files using the inbuilt Python CSV module.

What are the most common errors in CSV import?

Missing data Missing data is one of the most common errors for CSV imports. Examples include incomplete data that can be fixed by a user such as invoices that have month and day, but no year information.

What is a data translation error?

A data translation error could occur if the encoding is incorrect or unexpected. Another cause could be the presence of non-standard characters that aren’t usable. For example, it may be necessary to save a file with UTF-8 encoding in order for that file to work properly within a company’s platform.

Why is my CSV file not uploading?

Another key import error that pops up when uploading a CSV file is related to matching. This could be columns that don’t match expected field names caused by different values than expected, field names not on the first line, or simply the complete absence of column names.


1 Answers

After taking a breath and looking at the actual documentation, I see there is a way to get the parsing failures from read_csv() in a form that is very usable.

All I had to do to get the parsing failures was to use problems().

> library(tidyverse)
> df <- read_csv("./my_data.csv", col_types = cols(.default = "c"))
Warning: 2 parsing failures.
row # A tibble: 2 x 5 col     row   col           expected actual            file expected   <int> <chr>              <chr>  <chr>           <chr> actual 1     2 col_b delimiter or quote      b './my_data.csv' file 2     2 col_b delimiter or quote        './my_data.csv'

> parsing_failures <- problems(df)
> parsing_failures
# A tibble: 2 x 5
    row   col           expected actual            file
  <int> <chr>              <chr>  <chr>           <chr>
1     2 col_b delimiter or quote      b './my_data.csv'
2     2 col_b delimiter or quote        './my_data.csv'

Apparently read_csv() associates a tibble containing parsing failure details and this is accessible by passing the result from read_csv to problems().

like image 193
Angelo Avatar answered Oct 14 '22 14:10

Angelo