I have a file that has 22268 rows BY 2521 columns. When I try to read in the file using this line of code:
file <- read.table(textfile, skip=2, header=TRUE, sep="\t", fill=TRUE, blank.lines.skip=FALSE)
But I only get 13024 rows BY 2521 columns read in and the following error:
Warning message: In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : number of items read is not a multiple of the number of columns
I also used this command to see what rows had an incorrect number of columns:
x <-count.fields(textfile, sep="\t", skip=2)
incorrect <- which(x != 2521)
and got back a list of about 20 rows that were incorrect.
Is there a way to fill these rows with NA values?
I thought that is what the "fill" parameter does in the read.table function, but it doesn't appear so.
OR
Is there a way to ignore these rows that are identified in the "incorrect" variable?
you can use readLines()
to input the data, then find the offending rows.
con <- file("path/to/file.csv", "rb")
rawContent <- readLines(con) # empty
close(con) # close the connection to the file, to keep things tidy
then take a look at rawContent
To find the rows with an incorrect number of columns, for example:
expectedColumns <- 2521
delim <- "\t"
indxToOffenders <-
sapply(rawContent, function(x) # for each line in rawContent
length(gregexpr(delim, x)[[1]]) != expectedColumns # count the number of delims and compare that number to expectedColumns
)
Then to read in your data:
myDataFrame <- read.csv(rawContent[-indxToOffenders], header=??, sep=delim)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With