Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

read.csv in R doesn't import all rows from csv file

Tags:

r

I have a comma separated dataset of around 10,000 rows. When doing read.csv, R created a dataframe rows lesser than the original file. It excluded/rejected 200 rows. When I open the csv file in Excel, the file looks okay. The file is well formatted for line delimiters and also field delimiters (as per parsing done by Excel).

I have identified the row numbers in my file which are getting rejected but I can't identify the cause by glancing over them.

Is there any way to look at logs or something which includes reason why R rejected these records?

like image 545
user3422637 Avatar asked Nov 30 '22 01:11

user3422637


2 Answers

The OP indicates that the problem is caused by quotes in the CSV-file.

When the records in the CSV-file are not quoted, but only a few records contain quotes. The file can be opened using the quote="" option in read.csv. This disables quotes.

data <- read.csv(filename, quote="")

Another solution is to remove all quotes from the file, but this will also result in modified data (your strings don't contain any quotes anymore) and will give problems of your fields contain comma's.

lines <- readLines(filename)
lines <- gsub('"', '', lines, fixed=TRUE)
data <- read.csv(textConnection(lines))

A slightly more safe solution, which will only delete quotes when not just before or after a comma:

lines <- readLines(filename)
lines <- gsub('([^,])"([^,])', '\\1""\\2', lines)
data <- read.csv(textConnection(lines))
like image 153
Jan van der Laan Avatar answered Dec 05 '22 01:12

Jan van der Laan


I had same issue where difference between number of rows present in csv file and number of rows read by read.csv() command was significant. I used fread() command from data.table package in place of read.csv and it solved the problem.

like image 23
Adnan Khalid Avatar answered Dec 05 '22 03:12

Adnan Khalid