Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using fread to read data with double quotes and incorrect escape characters

Tags:

r

data.table

I try to load a large datafile (some 20 million rows) using fread() from the data.table package. Some rows are causing great trouble, however.

Minimal example:

text.csv contains:

id, text
1,"""Oops"",\""The"",""Georgia"""        

fread("text.csv", sep=",")

Error in fread("text.csv", sep = ",") : 
  Not positioned correctly after testing format of header row. ch=','
In addition: Warning message:
In fread("text.csv", sep = ",") :
  Starting data input on line 2 and discarding line 1 because it has too few or too many items to be column names or data: id, text

read.table() works somewhat better but is too slow and too memory inefficient.

> read.table("text.csv", header = TRUE, sep=",")
  id                     text
1  1 "Oops",\\"The","Georgia"

I realize that my text file is not properly formatted, but it is too large to edit in a practical manner.

Any help much appreciated.

EDIT:

A small sample of actual data records:

sample1.txt, a good record:

materiale_id,dk5,description,creator,subject-phrase,title,type
125030-katalog:000000003,[78.793],Privatoptagelse. - Liveoptagelse,Frederik Lundin,,Koncert i Copenhagen Jazz House den 26.1.1995,music

> fread("sample1.txt", sep=",")
               materiale_id      dk5                      description         creator subject-phrase
1: 125030-katalog:000000003 [78.793] Privatoptagelse. - Liveoptagelse Frederik Lundin             NA
                                           title  type
1: Koncert i Copenhagen Jazz House den 26.1.1995 music


sample2.txt, a good and a bad record:

materiale_id,dk5,description,creator,subject-phrase,title,type
125030-katalog:000000003,[78.793],Privatoptagelse. - Liveoptagelse,Frederik Lundin,,Koncert i Copenhagen Jazz House den 26.1.1995,music
150012-leksikon:100019,,"Databehandling vedrører rutiner og procedurer for datarepræsentation, lagring af data, overførsel af data mellem forskellige instanser eller brugere af data, beregninger eller andre operationer udført med...",,"[""Informatik"",""it"",""It, teknik og naturvidenskab"",""leksikonartikel"",""Software, programmering, internet og webkommunikation""]",it - elementer i databehandling,article

> fread("sample2.txt", sep=",")
Empty data.table (0 rows) of 11 cols: 150012-leksikon:100019,V2,Databehandling vedrører rutiner og procedurer for datarepræsentation, lagring af data, overførsel af data mellem forskellige instanser eller brugere af data, beregninger eller andre operationer udført med...,V4,[""Informatik","it"...

EDIT 2:

Updating to R version 3.2.3 and data.table 1.9.6. helps on the above, but creates issues with other records:

sample3.txt, a good and a bad record:

materiale_id,dk5,description,creator,subject-phrase,title,type
125030-katalog:000236595,,,Red Tampa Solist prf,"[""Tom"",""Georgia"",""1929-1930""]","Georgia Tom, 1929-1930",music
125030-katalog:000236596,,,Jane Lucas (Solist),"[""1928-1931"",""Tom,\""The"",""Georgia"",""Accompanist""]","Georgia Tom,""The Accompanist"" (1928-1931)",music

> s3 <- fread("sample3.txt", sep=",")
Error in fread("sample3.txt", sep = ",") : 
  Expecting 7 cols, but line 3 contains text after processing all cols. It is very likely that this is due to one or more fields having embedded sep=',' and/or (unescaped) '\n' characters within unbalanced unescaped quotes. fread cannot handle such ambiguous cases and those lines may not have been read in as expected. Please read the section on quotes in ?fread.

EDIT 3:

Updating to the development version 1.9.7 of data tables breaks fread() altogether:

> s3 <- fread("sample3.txt", sep=",")
Error in fread("sample3.txt", sep = ",") : 
  showProgress is not type integer but type 'logical'. Please report.

EDIT 4:

It seems that the problem in my file occurs when records contain the string \\" (litteraly, not regular expression). Apparently, there's one backslash too many, causing fread() to misinterpret a double quote as the end of a string, where it should have been taken litteraly.

My best solutions so far is to do this:

m1 <- readLines("data.csv", encoding="UTF-8")
m2 <- gsub("\\\\\"", "\\\"", m1)    
writeLines(m2, "data_new.csv", useBytes = TRUE)
m3 <- fread("data_new.csv", encoding="UTF-8", sep=",")

That seems to work.

I don't understand this 100% though, so any clarifications are more than welcome.

like image 330
Lasse Hjorth Madsen Avatar asked Feb 25 '16 12:02

Lasse Hjorth Madsen


People also ask

How do you escape a quote from a string?

Single quotes need to be escaped by backslash in single-quoted strings, and double quotes in double-quoted strings. Alternative forms for the last two are '⁠\u{nnnn}⁠' and '⁠\U{nnnnnnnn}⁠'.

How do you know if a string has a double quote?

To check if the string has double quotes you can use: text_line. Contains("\""); Here \" will escape the double-quote.


1 Answers

Not a data.table solution, but you could try:

# read the file with 'readLines'
tmp <- readLines("trl.txt")

# create a column name vector of the first line
nms <- trimws(strsplit(tmp[1],',')[[1]])

# convert 'tmp' to a dataframe except the first line
tmp <- as.data.frame(tmp[-1])

# use 'separate' from 'tidyr' to split into two columns
library(tidyr)
df1 <- separate(tmp, "tmp[-1]", nms, sep=",", extra = "merge")

which gives:

> df1
  id                             text
1  1 """Oops"",\\""The"",""Georgia"""

Update for edit 1: With the new example data fread seems to be reading the data normally:

> s1 <- fread("sample1.txt", sep=",")
> s1
               materiale_id      dk5                      description         creator subject-phrase                                         title  type
1: 125030-katalog:000000003 [78.793] Privatoptagelse. - Liveoptagelse Frederik Lundin             NA Koncert i Copenhagen Jazz House den 26.1.1995 music


> s2 <- fread("sample2.txt", sep=",")
> s2
               materiale_id      dk5
1: 125030-katalog:000000003 [78.793]
2:   150012-leksikon:100019         
                                                                                                                                                                                                           description
1:                                                                                                                                                                                    Privatoptagelse. - Liveoptagelse
2: Databehandling vedrører rutiner og procedurer for datarepræsentation, lagring af data, overførsel af data mellem forskellige instanser eller brugere af data, beregninger eller andre operationer udført med...
           creator                                                                                                                         subject-phrase
1: Frederik Lundin                                                                                                                                       
2:                 [""Informatik"",""it"",""It, teknik og naturvidenskab"",""leksikonartikel"",""Software, programmering, internet og webkommunikation""]
                                           title    type
1: Koncert i Copenhagen Jazz House den 26.1.1995   music
2:               it - elementer i databehandling article

Update for edit 2 & 3:

When you look at the error-message:

Error in fread("sample3.txt", sep = ",") : Expecting 7 cols, but line 3 contains text after processing all cols. It is very likely that this is due to one or more fields having embedded sep=',' and/or (unescaped) '\n' characters within unbalanced unescaped quotes. fread cannot handle such ambiguous cases and those lines may not have been read in as expected. Please read the section on quotes in ?fread.

and then when you look at the second line of sample3.txt you will see that the fourth column contains comma's as well. You can solve this in three steps:

1: Read the file with readLines and replace the opening and closing character of the fourth column with another quote-character:

r3 <- readLines("sample3.txt")
r3 <- gsub('\"[',"'",r3,fixed=TRUE)
r3 <- gsub(']\"',"'",r3,fixed=TRUE)

2: Write it back to a text-file:

 writeLines(r3, "sample3-1.txt")

3: Now you can read it with fread (or read.table/read.csv). Because the number of column-titles is not the same as the number of columns, you will have to use header = FALSE. Also explicitely set the quote-character to the new quote-character as inserted in step 2:

s3 <- fread("sample3-1.txt", quote = "\'", header = FALSE, skip = 1)

which gives:

> s3
                         V1 V2 V3                   V4                                                        V5           V6                               V7    V8
1: 125030-katalog:000236595 NA NA Red Tampa Solist prf                         ""Tom"",""Georgia"",""1929-1930"" "Georgia Tom                       1929-1930" music
2: 125030-katalog:000236596 NA NA  Jane Lucas (Solist) ""1928-1931"",""Tom,\\""The"",""Georgia"",""Accompanist"" "Georgia Tom ""The Accompanist"" (1928-1931)" music

After that you can assign column names as follows:

names(s3) <- c("character","vector","with","eight","column","names")

NOTE: I used a pretty recent version (two weeks old) of v1.9.7 for this

like image 198
Jaap Avatar answered Oct 25 '22 08:10

Jaap