Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Read CSV file with embedded double quotes and commas

Tags:

r

csv

data.table

I am trying to read a dirty CSV file using the fread() function from the data.table package but have a problem with embedded double quotes and commas in the string values, that is, unescaped double quotes present in a quoted field. The following example data illustrates my problem. It consists of 3 lines/rows and 6 columns, with the first line containing the column names:

"SA","SU","CC","CN","POC","PAC"
"NE","R","000","H "B", O","1","8"
"A","A","000","P","E,5","8"

The first problem is in line 2 which has an embedded pair of double quotes and a comma: "H "B", O". The second problem is in line 3 which has a comma within the double quotes: "E,5". I have tried the following:

Attempt 1

library(data.table)
x1 <- fread(file = "example.csv", quote = "\"")

Output:

> x1
     V1 "SA" "SU"   "CC" "CN" "POC" "PAC"
1: "NE"  "R"    0 "H "B"   O"   "1"     8
2:  "A"  "A"    0    "P"   "E    5"     8

Message:

Found and resolved improper quoting in first 100 rows. If the fields are not quoted (e.g. field separator does not appear within any field), try quote="" to avoid this warning.Detected 6 column names but the data has 7 columns (i.e. invalid file). Added 1 extra default column name for the first column which is guessed to be row names or an index. Use setnames() afterwards if this guess is not correct, or fix the file write command that created the file to create a valid file.

Conclusion: the result is incorrect because it adds a new column V1.

Attempt 2

x2 <- fread(file = "example.csv", quote = "")

Output:

> x2
     V1 "SA"  "SU"   "CC" "CN" "POC" "PAC"
1: "NE"  "R" "000" "H "B"   O"   "1"   "8"
2:  "A"  "A" "000"    "P"   "E    5"   "8"

Message:

Detected 6 column names but the data has 7 columns (i.e. invalid file). Added 1 extra default column name for the first column which is guessed to be row names or an index. Use setnames() afterwards if this guess is not correct, or fix the file write command that created the file to create a valid file.

Conclusion: the result is incorrect because it adds a new column V1..

Solution?

What I am looking for is a way to get an output similar to

> x3
   SA SU CC       CN POC PAC
1: NE  R  0 H 'B', O   1   8
2:  A  A  0        P E,5   8

preferably using fread(), but other suggestions are welcome.

like image 500
Samuel Avatar asked Oct 23 '18 20:10

Samuel


People also ask

How do you handle double quotes and commas in a CSV file?

Since CSV files use the comma character "," to separate columns, values that contain commas must be handled as a special case. These fields are wrapped within double quotation marks. The first double quote signifies the beginning of the column data, and the last double quote marks the end.

Why are there double quotes in my CSV file?

CSV files use double-quote marks to delimit field values that have spaces, so a value like Santa Claus gets saved as “Santa Claus” to preserve the space. If a field value contains double-quotes, then the double-quotes get doubled-up so the parser can tell the difference.

How do you load a CSV file with commas in the data?

Using the "From Text" feature in Excel Click the Data tab, then From Text. Select the CSV file that has the data clustered into one column. Select Delimited, then make sure the File Origin is Unicode UTF-8. Select Comma (this is Affinity's default list separator).


1 Answers

You could try cleaning your data beforehand and replace the double quotes with single quotes.

x = readLines('my_file.csv')
y = gsub('","', "','", x) # replace double quotes for each field
y = gsub('^"|"$', "'", y) # replace trailing and leading double quotes
z = paste(y, collapse='\n') # turn it back into a table for fread to read
df = fread(z, quote="'")
df

   SA SU CC       CN POC PAC
1: NE  R  0 H "B", O   1   8
2:  A  A  0        P E,5   8

I can't confirm that this is efficient since I don't know how big your file is, but it might be a worthwhile approach.

like image 103
tblznbits Avatar answered Sep 22 '22 08:09

tblznbits