I have a comma separated dataset of around 10,000 rows. When doing read.csv, R created a dataframe rows lesser than the original file. It excluded/rejected 200 rows. When I open the csv file in Excel, the file looks okay. The file is well formatted for line delimiters and also field delimiters (as per parsing done by Excel).
I have identified the row numbers in my file which are getting rejected but I can't identify the cause by glancing over them.
Is there any way to look at logs or something which includes reason why R rejected these records?
The OP indicates that the problem is caused by quotes in the CSV-file.
When the records in the CSV-file are not quoted, but only a few records contain quotes. The file can be opened using the quote=""
option in read.csv
. This disables quotes.
data <- read.csv(filename, quote="")
Another solution is to remove all quotes from the file, but this will also result in modified data (your strings don't contain any quotes anymore) and will give problems of your fields contain comma's.
lines <- readLines(filename)
lines <- gsub('"', '', lines, fixed=TRUE)
data <- read.csv(textConnection(lines))
A slightly more safe solution, which will only delete quotes when not just before or after a comma:
lines <- readLines(filename)
lines <- gsub('([^,])"([^,])', '\\1""\\2', lines)
data <- read.csv(textConnection(lines))
I had same issue where difference between number of rows present in csv file and number of rows read by read.csv()
command was significant. I used fread()
command from data.table
package in place of read.csv and it solved the problem.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With