Using fread to read data with double quotes and incorrect escape characters

Tags:

It seems that the problem in my file occurs when records contain the string \\" (litteraly, not regular expression). Apparently, there's one backslash too many, causing fread() to misinterpret a double quote as the end of a string, where it should have been taken litteraly.

My best solutions so far is to do this:

m1 <- readLines("data.csv", encoding="UTF-8")
m2 <- gsub("\\\\\"", "\\\"", m1)    
writeLines(m2, "data_new.csv", useBytes = TRUE)
m3 <- fread("data_new.csv", encoding="UTF-8", sep=",")

That seems to work.

I don't understand this 100% though, so any clarifications are more than welcome.

330

asked Feb 25 '16 12:02

Lasse Hjorth Madsen

1 Answers

Not a data.table solution, but you could try:

# read the file with 'readLines'
tmp <- readLines("trl.txt")

# create a column name vector of the first line
nms <- trimws(strsplit(tmp[1],',')[[1]])

# convert 'tmp' to a dataframe except the first line
tmp <- as.data.frame(tmp[-1])

# use 'separate' from 'tidyr' to split into two columns
library(tidyr)
df1 <- separate(tmp, "tmp[-1]", nms, sep=",", extra = "merge")

which gives:

> df1
  id                             text
1  1 """Oops"",\\""The"",""Georgia"""

Update for edit 1: With the new example data fread seems to be reading the data normally:

> s1 <- fread("sample1.txt", sep=",")
> s1
               materiale_id      dk5                      description         creator subject-phrase                                         title  type
1: 125030-katalog:000000003 [78.793] Privatoptagelse. - Liveoptagelse Frederik Lundin             NA Koncert i Copenhagen Jazz House den 26.1.1995 music


> s2 <- fread("sample2.txt", sep=",")
> s2
               materiale_id      dk5
1: 125030-katalog:000000003 [78.793]
2:   150012-leksikon:100019         
                                                                                                                                                                                                           description
1:                                                                                                                                                                                    Privatoptagelse. - Liveoptagelse
2: Databehandling vedrÃ¸rer rutiner og procedurer for datareprÃ¦sentation, lagring af data, overfÃ¸rsel af data mellem forskellige instanser eller brugere af data, beregninger eller andre operationer udfÃ¸rt med...
           creator                                                                                                                         subject-phrase
1: Frederik Lundin                                                                                                                                       
2:                 [""Informatik"",""it"",""It, teknik og naturvidenskab"",""leksikonartikel"",""Software, programmering, internet og webkommunikation""]
                                           title    type
1: Koncert i Copenhagen Jazz House den 26.1.1995   music
2:               it - elementer i databehandling article

Update for edit 2 & 3:

When you look at the error-message:

Error in fread("sample3.txt", sep = ",") : Expecting 7 cols, but line 3 contains text after processing all cols. It is very likely that this is due to one or more fields having embedded sep=',' and/or (unescaped) '\n' characters within unbalanced unescaped quotes. fread cannot handle such ambiguous cases and those lines may not have been read in as expected. Please read the section on quotes in ?fread.

and then when you look at the second line of sample3.txt you will see that the fourth column contains comma's as well. You can solve this in three steps:

1: Read the file with readLines and replace the opening and closing character of the fourth column with another quote-character:

r3 <- readLines("sample3.txt")
r3 <- gsub('\"[',"'",r3,fixed=TRUE)
r3 <- gsub(']\"',"'",r3,fixed=TRUE)

2: Write it back to a text-file:

 writeLines(r3, "sample3-1.txt")

3: Now you can read it with fread (or read.table/read.csv). Because the number of column-titles is not the same as the number of columns, you will have to use header = FALSE. Also explicitely set the quote-character to the new quote-character as inserted in step 2:

s3 <- fread("sample3-1.txt", quote = "\'", header = FALSE, skip = 1)

which gives:

> s3
                         V1 V2 V3                   V4                                                        V5           V6                               V7    V8
1: 125030-katalog:000236595 NA NA Red Tampa Solist prf                         ""Tom"",""Georgia"",""1929-1930"" "Georgia Tom                       1929-1930" music
2: 125030-katalog:000236596 NA NA  Jane Lucas (Solist) ""1928-1931"",""Tom,\\""The"",""Georgia"",""Accompanist"" "Georgia Tom ""The Accompanist"" (1928-1931)" music

After that you can assign column names as follows:

names(s3) <- c("character","vector","with","eight","column","names")

_{NOTE: I used a pretty recent version (two weeks old) of v1.9.7 for this}

198

answered Oct 25 '22 08:10

Jaap

Related questions
                            
                                Plot timeline in R with only date variable
                            
                                Performing loops on list of lists of rasters
                            
                                Contour plots on a sphere surface
                            
                                rmarkdown html embed pdf/eps
                            
                                Plotting shp file in leaflet, works in ggplot
                            
                                How to get layer to top in Shiny Leaflet map
                            
                                R Lubridate Returns Unwanted Century When Given Two Digit Year
                            
                                Weird behaviour of lag function of dplyr inside mutate
                            
                                R ggplot geom_hex alpha transparency
                            
                                runapp shiny from other shiny application with actionbutton
                            
                                Manually creating a legend when you can't supply a color aesthetic
                            
                                convert factor to original numeric value
                            
                                R Shinyapps advanced settings
                            
                                R - Color or shade area between lines
                            
                                How to randomize a vector without repeating specific elements in predefined triples?
                            
                                Adding two rows of column names in Stargazer regression table
                            
                                shift() in data.table v1.9.6 is slow for many groups
                            
                                How to use tidyr (or similar data wrangling) on "big" data in postgreSQL (Redshift) database
                            
                                Using `:=` from data.table when data.table package is not loaded?
                            
                                ggplot2 geom_smooth with variable as factor

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Using fread to read data with double quotes and incorrect escape characters

Tags:

r

data.table

Lasse Hjorth Madsen

People also ask

1 Answers

Jaap

Recent Activity

Donate For Us