The order of my data is important. If I load a CSV into R using read.csv
, is the order of the rows in the dataframe guaranteed to match that of the CSV?
How about if I load a bunch of CSVs and rbind
them together and then use subset
to get at the data I'm interested in?
For example:
1.csv
foo,bar
a,123
a,456
c,789
2.csv
foo,bar
d,987
a,999
b,654
a,321
Will the following:
data1<-read.csv("1.csv", header=T)
data2<-read.csv("2.csv", header=T)
all_data<-rbind(data1, data2)
filtered<-subset(all_data, foo=="a")
...always produce a filtered
as:
foo bar
1 a 123
2 a 456
3 a 999
4 a 321
...and does this behaviour hold for arbitrary CSV inputs and filters?
The CSV file is stored in the current working directory with the name specified in the function write.csv (data frame, output CSV name) in R. The column X contains the row numbers of the original CSV file.
The CSV file is stored in the current working directory with the name specified in the function write.csv (data frame, output CSV name) in R. The column X contains the row numbers of the original CSV file. In order to remove it, we can specify an additional argument in the write.csv () function that set row names to FALSE.
R can create csv file form existing data frame. The write.csv () function is used to create the csv file. This file gets created in the working directory. Here the column X comes from the data set newper. This can be dropped using additional parameters while writing the file.
Moreover, you could also order the vector x by the index vector of the vector y. When working with a matrix or a data frame in R you could want to order the data by row or by column values.
Have a read through the source code for read.table
. It uses the scan
base function, which itself uses the file
and textConnection
functions. All of these appear to point toward you being able to read in data sequentially ("line" by "line" based on delimiter) and feeding it in.
function (file, header = FALSE, sep = "", quote = "\"'", dec = ".",
numerals = c("allow.loss", "warn.loss", "no.loss"), row.names,
col.names, as.is = !stringsAsFactors, na.strings = "NA",
colClasses = NA, nrows = -1, skip = 0, check.names = TRUE,
fill = !blank.lines.skip, strip.white = FALSE, blank.lines.skip = TRUE,
comment.char = "#", allowEscapes = FALSE, flush = FALSE,
stringsAsFactors = default.stringsAsFactors(), fileEncoding = "",
encoding = "unknown", text, skipNul = FALSE)
{
if (missing(file) && !missing(text)) {
file <- textConnection(text, encoding = "UTF-8")
encoding <- "UTF-8"
on.exit(close(file))
}
if (is.character(file)) {
file <- if (nzchar(fileEncoding))
file(file, "rt", encoding = fileEncoding)
else file(file, "rt")
on.exit(close(file))
}
if (!inherits(file, "connection"))
stop("'file' must be a character string or connection")
if (!isOpen(file, "rt")) {
open(file, "rt")
on.exit(close(file))
}
pbEncoding <- if (encoding %in% c("", "bytes", "UTF-8"))
encoding
else "bytes"
numerals <- match.arg(numerals)
if (skip > 0L)
readLines(file, skip)
nlines <- n0lines <- if (nrows < 0L)
5
else min(5L, (header + nrows))
lines <- .External(C_readtablehead, file, nlines, comment.char,
blank.lines.skip, quote, sep, skipNul)
if (encoding %in% c("UTF-8", "latin1"))
Encoding(lines) <- encoding
nlines <- length(lines)
if (!nlines) {
if (missing(col.names))
stop("no lines available in input")
rlabp <- FALSE
cols <- length(col.names)
}
else {
if (all(!nzchar(lines)))
stop("empty beginning of file")
if (nlines < n0lines && file == 0L) {
pushBack(c(lines, lines, ""), file, encoding = pbEncoding)
on.exit((clearPushBack(stdin())))
}
else pushBack(c(lines, lines), file, encoding = pbEncoding)
first <- scan(file, what = "", sep = sep, quote = quote,
nlines = 1, quiet = TRUE, skip = 0, strip.white = TRUE,
blank.lines.skip = blank.lines.skip, comment.char = comment.char,
allowEscapes = allowEscapes, encoding = encoding,
skipNul = skipNul)
col1 <- if (missing(col.names))
length(first)
else length(col.names)
col <- numeric(nlines - 1L)
if (nlines > 1L)
for (i in seq_along(col)) col[i] <- length(scan(file,
what = "", sep = sep, quote = quote, nlines = 1,
quiet = TRUE, skip = 0, strip.white = strip.white,
blank.lines.skip = blank.lines.skip, comment.char = comment.char,
allowEscapes = allowEscapes, encoding = encoding,
skipNul = skipNul))
cols <- max(col1, col)
rlabp <- (cols - col1) == 1L
if (rlabp && missing(header))
header <- TRUE
if (!header)
rlabp <- FALSE
if (header) {
.External(C_readtablehead, file, 1L, comment.char,
blank.lines.skip, quote, sep, skipNul)
if (missing(col.names))
col.names <- first
else if (length(first) != length(col.names))
warning("header and 'col.names' are of different lengths")
}
else if (missing(col.names))
col.names <- paste0("V", 1L:cols)
if (length(col.names) + rlabp < cols)
stop("more columns than column names")
if (fill && length(col.names) > cols)
cols <- length(col.names)
if (!fill && cols > 0L && length(col.names) > cols)
stop("more column names than columns")
if (cols == 0L)
stop("first five rows are empty: giving up")
}
if (check.names)
col.names <- make.names(col.names, unique = TRUE)
if (rlabp)
col.names <- c("row.names", col.names)
nmColClasses <- names(colClasses)
if (is.null(nmColClasses)) {
if (length(colClasses) < cols)
colClasses <- rep_len(colClasses, cols)
}
else {
tmp <- rep_len(NA_character_, cols)
names(tmp) <- col.names
i <- match(nmColClasses, col.names, 0L)
if (any(i <= 0L))
warning("not all columns named in 'colClasses' exist")
tmp[i[i > 0L]] <- colClasses[i > 0L]
colClasses <- tmp
}
what <- rep.int(list(""), cols)
names(what) <- col.names
colClasses[colClasses %in% c("real", "double")] <- "numeric"
known <- colClasses %in% c("logical", "integer", "numeric",
"complex", "character", "raw")
what[known] <- sapply(colClasses[known], do.call, list(0))
what[colClasses %in% "NULL"] <- list(NULL)
keep <- !sapply(what, is.null)
data <- scan(file = file, what = what, sep = sep, quote = quote,
dec = dec, nmax = nrows, skip = 0, na.strings = na.strings,
quiet = TRUE, fill = fill, strip.white = strip.white,
blank.lines.skip = blank.lines.skip, multi.line = FALSE,
comment.char = comment.char, allowEscapes = allowEscapes,
flush = flush, encoding = encoding, skipNul = skipNul)
nlines <- length(data[[which.max(keep)]])
if (cols != length(data)) {
warning("cols = ", cols, " != length(data) = ", length(data),
domain = NA)
cols <- length(data)
}
if (is.logical(as.is)) {
as.is <- rep_len(as.is, cols)
}
else if (is.numeric(as.is)) {
if (any(as.is < 1 | as.is > cols))
stop("invalid numeric 'as.is' expression")
i <- rep.int(FALSE, cols)
i[as.is] <- TRUE
as.is <- i
}
else if (is.character(as.is)) {
i <- match(as.is, col.names, 0L)
if (any(i <= 0L))
warning("not all columns named in 'as.is' exist")
i <- i[i > 0L]
as.is <- rep.int(FALSE, cols)
as.is[i] <- TRUE
}
else if (length(as.is) != cols)
stop(gettextf("'as.is' has the wrong length %d != cols = %d",
length(as.is), cols), domain = NA)
do <- keep & !known
if (rlabp)
do[1L] <- FALSE
for (i in (1L:cols)[do]) {
data[[i]] <- if (is.na(colClasses[i]))
type.convert(data[[i]], as.is = as.is[i], dec = dec,
numerals = numerals, na.strings = character(0L))
else if (colClasses[i] == "factor")
as.factor(data[[i]])
else if (colClasses[i] == "Date")
as.Date(data[[i]])
else if (colClasses[i] == "POSIXct")
as.POSIXct(data[[i]])
else methods::as(data[[i]], colClasses[i])
}
compactRN <- TRUE
if (missing(row.names)) {
if (rlabp) {
row.names <- data[[1L]]
data <- data[-1L]
keep <- keep[-1L]
compactRN <- FALSE
}
else row.names <- .set_row_names(as.integer(nlines))
}
else if (is.null(row.names)) {
row.names <- .set_row_names(as.integer(nlines))
}
else if (is.character(row.names)) {
compactRN <- FALSE
if (length(row.names) == 1L) {
rowvar <- (1L:cols)[match(col.names, row.names, 0L) ==
1L]
row.names <- data[[rowvar]]
data <- data[-rowvar]
keep <- keep[-rowvar]
}
}
else if (is.numeric(row.names) && length(row.names) == 1L) {
compactRN <- FALSE
rlabp <- row.names
row.names <- data[[rlabp]]
data <- data[-rlabp]
keep <- keep[-rlabp]
}
else stop("invalid 'row.names' specification")
data <- data[keep]
if (is.object(row.names) || !(is.integer(row.names)))
row.names <- as.character(row.names)
if (!compactRN) {
if (length(row.names) != nlines)
stop("invalid 'row.names' length")
if (anyDuplicated(row.names))
stop("duplicate 'row.names' are not allowed")
if (anyNA(row.names))
stop("missing values in 'row.names' are not allowed")
}
class(data) <- "data.frame"
attr(data, "row.names") <- row.names
data
}
This is a basic code that you can use to double check results coming from read.csv
and subset
:
Compare read.csv
with readLines
Here you have a code that compare the result coming from read.csv
with readLines
(function reading line by line a file)
library("readr" )
library("rlist")
file1<-file.choose() #Select your csv file1
file2<-file.choose() #Select your csv file2
#readLines
input_list<-strsplit(readLines(file1),",")
db_readLines<-data.frame(list.rbind(input_list[2:length(input_list)]))
names(db_readLines)<-input_list[[1]]
#readd.csv
db_readcsv<-read.csv(file1,header = T,sep = ",")
#Comparison
if ((sum(db_readcsv==db_readLines)/(nrow(db_readcsv)*ncol(db_readcsv)))==1)
{
cat("Same data.frame")
} else
{
cat("Data.frames are differents")
}
You can use it with your csv file to compare results and verify that read.csv
preserves lines order as readLines
.
Compare subset
with rbind
+ basic filtering
About the second part of the question another easy test:
data1<-read.csv(file1, header=T,sep=",")
data2<-read.csv(file2, header=T,sep=",")
all_data<-rbind(data1, data2)
filtered1<-subset(all_data, foo=="a")
filtered2<-rbind(data1[data1$foo=="a",],data2[data2$foo=="a",])
#Comparison
if ((sum(filtered1==filtered2)/(nrow(filtered2)*ncol(filtered2)))==1)
{
cat("Same data.frame")
} else
{
cat("Data.frames are differents")
}
You can include this kind of tests in your code, but obviously this is inefficient and redundant.
It is safe to assume that all of those functions (read.csv
, rbind
, and subset
) are guaranteed to preserve the order of your data as in the original csv.
Personally, I prefer using dplyr::filter
over base::subset
. As explained in this answer the two work almost identically. The main difference is that subset
comes with a warning in ?subset
: "This is a convenience function intended for use interactively. For programming it is better to use the standard subsetting functions like [
, and in particular the non-standard evaluation of argument subset can have unanticipated consequences." filter
is designed to work robustly with the rest of dplyr
and the tidyverse
, both interactively and programmatically, and has a separate standard evaluation version filter_
for when necessary.. So perhaps filter
is a safer bet, especially if you're already using the dplyr
framework. The only disadvantage to filter
that I've encountered is that it does not keep rownames, while subset
does.
Either way, I really don't think you need to worry about rows being reshuffled. In my experience, all of these functions have always produced R objects ordered in terms of the original data. If you want to be ultra-careful, it wouldn't hurt to go with @user127649's suggestion and add a unique ID column as a back up. I'm always in favor of lazier options, but it might be worth peace of mind!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With