The order of my data is important. If I load a CSV into R using <code>read.csv</code>, is the order of the rows in the dataframe guaranteed to match that of the CSV? How about if I load a bunch of CSVs and <code>rbind</code> them together and then use <code>subset</code> to get at the data I'm interested in? For example: 1.csv <pre class="prettyprint"><code>foo,bar a,123 a,456 c,789 </code></pre> 2.csv <pre class="prettyprint"><code>foo,bar d,987 a,999 b,654 a,321 </code></pre> Will the following: <pre class="prettyprint"><code>data1<-read.csv("1.csv", header=T) data2<-read.csv("2.csv", header=T) all_data<-rbind(data1, data2) filtered<-subset(all_data, foo=="a") </code></pre> ...always produce a <code>filtered</code> as: <pre class="prettyprint"><code> foo bar 1 a 123 2 a 456 3 a 999 4 a 321 </code></pre> ...and does this behaviour hold for arbitrary CSV inputs and filters?

It is safe to assume that all of those functions (<code>read.csv</code>, <code>rbind</code>, and <code>subset</code>) are guaranteed to preserve the order of your data as in the original csv. Personally, I prefer using <code>dplyr::filter</code> over <code>base::subset</code>. As explained in this answer the two work almost identically. The main difference is that <code>subset</code> comes with a warning in <code>?subset</code>: "This is a convenience function intended for use interactively. For programming it is better to use the standard subsetting functions like <code>[</code>, and in particular the non-standard evaluation of argument subset can have unanticipated consequences." <code>filter</code> is designed to work robustly with the rest of <code>dplyr</code> and the <code>tidyverse</code>, both interactively and programmatically, and has a separate standard evaluation version <code>filter_</code> for when necessary.. So perhaps <code>filter</code> is a safer bet, especially if you're already using the <code>dplyr</code> framework. The only disadvantage to <code>filter</code> that I've encountered is that it does not keep rownames, while <code>subset</code> does. Either way, I really don't think you need to worry about rows being reshuffled. In my experience, all of these functions have always produced R objects ordered in terms of the original data. If you want to be ultra-careful, it wouldn't hurt to go with @user127649's suggestion and add a unique ID column as a back up. I'm always in favor of lazier options, but it might be worth peace of mind!

CSV order preserved in R

Tags:

dataframe

r

csv

subset

The order of my data is important. If I load a CSV into R using read.csv, is the order of the rows in the dataframe guaranteed to match that of the CSV?

How about if I load a bunch of CSVs and rbind them together and then use subset to get at the data I'm interested in?

For example:

1.csv

foo,bar
a,123
a,456
c,789

2.csv

foo,bar
d,987
a,999
b,654
a,321

Will the following:

data1<-read.csv("1.csv", header=T)
data2<-read.csv("2.csv", header=T)
all_data<-rbind(data1, data2)
filtered<-subset(all_data, foo=="a")

...always produce a filtered as:

   foo  bar
1    a  123
2    a  456
3    a  999
4    a  321

...and does this behaviour hold for arbitrary CSV inputs and filters?

863

asked Jan 18 '17 18:01

Xophmeister

3 Answers

Have a read through the source code for read.table. It uses the scan base function, which itself uses the file and textConnection functions. All of these appear to point toward you being able to read in data sequentially ("line" by "line" based on delimiter) and feeding it in.

function (file, header = FALSE, sep = "", quote = "\"'", dec = ".", 
    numerals = c("allow.loss", "warn.loss", "no.loss"), row.names, 
    col.names, as.is = !stringsAsFactors, na.strings = "NA", 
    colClasses = NA, nrows = -1, skip = 0, check.names = TRUE, 
    fill = !blank.lines.skip, strip.white = FALSE, blank.lines.skip = TRUE, 
    comment.char = "#", allowEscapes = FALSE, flush = FALSE, 
    stringsAsFactors = default.stringsAsFactors(), fileEncoding = "", 
    encoding = "unknown", text, skipNul = FALSE) 
{
    if (missing(file) && !missing(text)) {
        file <- textConnection(text, encoding = "UTF-8")
        encoding <- "UTF-8"
        on.exit(close(file))
    }
    if (is.character(file)) {
        file <- if (nzchar(fileEncoding)) 
            file(file, "rt", encoding = fileEncoding)
        else file(file, "rt")
        on.exit(close(file))
    }
    if (!inherits(file, "connection")) 
        stop("'file' must be a character string or connection")
    if (!isOpen(file, "rt")) {
        open(file, "rt")
        on.exit(close(file))
    }
    pbEncoding <- if (encoding %in% c("", "bytes", "UTF-8")) 
        encoding
    else "bytes"
    numerals <- match.arg(numerals)
    if (skip > 0L) 
        readLines(file, skip)
    nlines <- n0lines <- if (nrows < 0L) 
        5
    else min(5L, (header + nrows))
    lines <- .External(C_readtablehead, file, nlines, comment.char, 
        blank.lines.skip, quote, sep, skipNul)
    if (encoding %in% c("UTF-8", "latin1")) 
        Encoding(lines) <- encoding
    nlines <- length(lines)
    if (!nlines) {
        if (missing(col.names)) 
            stop("no lines available in input")
        rlabp <- FALSE
        cols <- length(col.names)
    }
    else {
        if (all(!nzchar(lines))) 
            stop("empty beginning of file")
        if (nlines < n0lines && file == 0L) {
            pushBack(c(lines, lines, ""), file, encoding = pbEncoding)
            on.exit((clearPushBack(stdin())))
        }
        else pushBack(c(lines, lines), file, encoding = pbEncoding)
        first <- scan(file, what = "", sep = sep, quote = quote, 
            nlines = 1, quiet = TRUE, skip = 0, strip.white = TRUE, 
            blank.lines.skip = blank.lines.skip, comment.char = comment.char, 
            allowEscapes = allowEscapes, encoding = encoding, 
            skipNul = skipNul)
        col1 <- if (missing(col.names)) 
            length(first)
        else length(col.names)
        col <- numeric(nlines - 1L)
        if (nlines > 1L) 
            for (i in seq_along(col)) col[i] <- length(scan(file, 
                what = "", sep = sep, quote = quote, nlines = 1, 
                quiet = TRUE, skip = 0, strip.white = strip.white, 
                blank.lines.skip = blank.lines.skip, comment.char = comment.char, 
                allowEscapes = allowEscapes, encoding = encoding, 
                skipNul = skipNul))
        cols <- max(col1, col)
        rlabp <- (cols - col1) == 1L
        if (rlabp && missing(header)) 
            header <- TRUE
        if (!header) 
            rlabp <- FALSE
        if (header) {
            .External(C_readtablehead, file, 1L, comment.char, 
                blank.lines.skip, quote, sep, skipNul)
            if (missing(col.names)) 
                col.names <- first
            else if (length(first) != length(col.names)) 
                warning("header and 'col.names' are of different lengths")
        }
        else if (missing(col.names)) 
            col.names <- paste0("V", 1L:cols)
        if (length(col.names) + rlabp < cols) 
            stop("more columns than column names")
        if (fill && length(col.names) > cols) 
            cols <- length(col.names)
        if (!fill && cols > 0L && length(col.names) > cols) 
            stop("more column names than columns")
        if (cols == 0L) 
            stop("first five rows are empty: giving up")
    }
    if (check.names) 
        col.names <- make.names(col.names, unique = TRUE)
    if (rlabp) 
        col.names <- c("row.names", col.names)
    nmColClasses <- names(colClasses)
    if (is.null(nmColClasses)) {
        if (length(colClasses) < cols) 
            colClasses <- rep_len(colClasses, cols)
    }
    else {
        tmp <- rep_len(NA_character_, cols)
        names(tmp) <- col.names
        i <- match(nmColClasses, col.names, 0L)
        if (any(i <= 0L)) 
            warning("not all columns named in 'colClasses' exist")
        tmp[i[i > 0L]] <- colClasses[i > 0L]
        colClasses <- tmp
    }
    what <- rep.int(list(""), cols)
    names(what) <- col.names
    colClasses[colClasses %in% c("real", "double")] <- "numeric"
    known <- colClasses %in% c("logical", "integer", "numeric", 
        "complex", "character", "raw")
    what[known] <- sapply(colClasses[known], do.call, list(0))
    what[colClasses %in% "NULL"] <- list(NULL)
    keep <- !sapply(what, is.null)
    data <- scan(file = file, what = what, sep = sep, quote = quote, 
        dec = dec, nmax = nrows, skip = 0, na.strings = na.strings, 
        quiet = TRUE, fill = fill, strip.white = strip.white, 
        blank.lines.skip = blank.lines.skip, multi.line = FALSE, 
        comment.char = comment.char, allowEscapes = allowEscapes, 
        flush = flush, encoding = encoding, skipNul = skipNul)
    nlines <- length(data[[which.max(keep)]])
    if (cols != length(data)) {
        warning("cols = ", cols, " != length(data) = ", length(data), 
            domain = NA)
        cols <- length(data)
    }
    if (is.logical(as.is)) {
        as.is <- rep_len(as.is, cols)
    }
    else if (is.numeric(as.is)) {
        if (any(as.is < 1 | as.is > cols)) 
            stop("invalid numeric 'as.is' expression")
        i <- rep.int(FALSE, cols)
        i[as.is] <- TRUE
        as.is <- i
    }
    else if (is.character(as.is)) {
        i <- match(as.is, col.names, 0L)
        if (any(i <= 0L)) 
            warning("not all columns named in 'as.is' exist")
        i <- i[i > 0L]
        as.is <- rep.int(FALSE, cols)
        as.is[i] <- TRUE
    }
    else if (length(as.is) != cols) 
        stop(gettextf("'as.is' has the wrong length %d  != cols = %d", 
            length(as.is), cols), domain = NA)
    do <- keep & !known
    if (rlabp) 
        do[1L] <- FALSE
    for (i in (1L:cols)[do]) {
        data[[i]] <- if (is.na(colClasses[i])) 
            type.convert(data[[i]], as.is = as.is[i], dec = dec, 
                numerals = numerals, na.strings = character(0L))
        else if (colClasses[i] == "factor") 
            as.factor(data[[i]])
        else if (colClasses[i] == "Date") 
            as.Date(data[[i]])
        else if (colClasses[i] == "POSIXct") 
            as.POSIXct(data[[i]])
        else methods::as(data[[i]], colClasses[i])
    }
    compactRN <- TRUE
    if (missing(row.names)) {
        if (rlabp) {
            row.names <- data[[1L]]
            data <- data[-1L]
            keep <- keep[-1L]
            compactRN <- FALSE
        }
        else row.names <- .set_row_names(as.integer(nlines))
    }
    else if (is.null(row.names)) {
        row.names <- .set_row_names(as.integer(nlines))
    }
    else if (is.character(row.names)) {
        compactRN <- FALSE
        if (length(row.names) == 1L) {
            rowvar <- (1L:cols)[match(col.names, row.names, 0L) == 
                1L]
            row.names <- data[[rowvar]]
            data <- data[-rowvar]
            keep <- keep[-rowvar]
        }
    }
    else if (is.numeric(row.names) && length(row.names) == 1L) {
        compactRN <- FALSE
        rlabp <- row.names
        row.names <- data[[rlabp]]
        data <- data[-rlabp]
        keep <- keep[-rlabp]
    }
    else stop("invalid 'row.names' specification")
    data <- data[keep]
    if (is.object(row.names) || !(is.integer(row.names))) 
        row.names <- as.character(row.names)
    if (!compactRN) {
        if (length(row.names) != nlines) 
            stop("invalid 'row.names' length")
        if (anyDuplicated(row.names)) 
            stop("duplicate 'row.names' are not allowed")
        if (anyNA(row.names)) 
            stop("missing values in 'row.names' are not allowed")
    }
    class(data) <- "data.frame"
    attr(data, "row.names") <- row.names
    data
}

186

answered Oct 16 '22 03:10

Kamil

This is a basic code that you can use to double check results coming from read.csv and subset:

Compare read.csv with readLines

Here you have a code that compare the result coming from read.csv with readLines (function reading line by line a file)

  library("readr" )
  library("rlist")
  file1<-file.choose() #Select your csv file1
  file2<-file.choose() #Select your csv file2

  #readLines
  input_list<-strsplit(readLines(file1),",")
  db_readLines<-data.frame(list.rbind(input_list[2:length(input_list)]))
  names(db_readLines)<-input_list[[1]]

  #readd.csv
  db_readcsv<-read.csv(file1,header = T,sep = ",")

  #Comparison
  if ((sum(db_readcsv==db_readLines)/(nrow(db_readcsv)*ncol(db_readcsv)))==1)
  {
    cat("Same data.frame")
  } else
  {
    cat("Data.frames are differents")
  }

You can use it with your csv file to compare results and verify that read.csv preserves lines order as readLines.

Compare subset with rbind + basic filtering

About the second part of the question another easy test:

data1<-read.csv(file1, header=T,sep=",")
  data2<-read.csv(file2, header=T,sep=",")
  all_data<-rbind(data1, data2)
  filtered1<-subset(all_data, foo=="a")

  filtered2<-rbind(data1[data1$foo=="a",],data2[data2$foo=="a",])

  #Comparison
  if ((sum(filtered1==filtered2)/(nrow(filtered2)*ncol(filtered2)))==1)
  {
    cat("Same data.frame")
  } else
  {
    cat("Data.frames are differents")
  }

You can include this kind of tests in your code, but obviously this is inefficient and redundant.

answered Oct 16 '22 03:10

Terru_theTerror

It is safe to assume that all of those functions (read.csv, rbind, and subset) are guaranteed to preserve the order of your data as in the original csv.

Personally, I prefer using dplyr::filter over base::subset. As explained in this answer the two work almost identically. The main difference is that subset comes with a warning in ?subset: "This is a convenience function intended for use interactively. For programming it is better to use the standard subsetting functions like [, and in particular the non-standard evaluation of argument subset can have unanticipated consequences." filter is designed to work robustly with the rest of dplyr and the tidyverse, both interactively and programmatically, and has a separate standard evaluation version filter_ for when necessary.. So perhaps filter is a safer bet, especially if you're already using the dplyr framework. The only disadvantage to filter that I've encountered is that it does not keep rownames, while subset does.

Either way, I really don't think you need to worry about rows being reshuffled. In my experience, all of these functions have always produced R objects ordered in terms of the original data. If you want to be ultra-careful, it wouldn't hurt to go with @user127649's suggestion and add a unique ID column as a back up. I'm always in favor of lazier options, but it might be worth peace of mind!

answered Oct 16 '22 02:10

Joy

Related questions
                            
                                A caterpillar plot of just the "significant" random effects from a mixed effects model
                            
                                R - Split by "\n" or three spaces and retain at least one space when there are three spaces
                            
                                Fastest way for doing 21 day rolling sum for an ActivityType
                            
                                Aggregating all unique values of each column of data frame
                            
                                How to merge multiple data.frames and sum and average columns at the same time in R
                            
                                ggplot: line plot for discrete x-axis
                            
                                R foreach: from single-machine to cluster
                            
                                Identify a weblink in bold in R
                            
                                Change values in data frame in a specific row using dplyr
                            
                                Delete Redundant columns in R [duplicate]
                            
                                glmnet: How do I know which factor level of my response is coded as 1 in logistic regression
                            
                                Put quotation marks around each element of a vector, and separate with comma
                            
                                How does the PACKAGE argument to .Call work?
                            
                                R - sample used in %in% modify dataframe which is being subsetted
                            
                                Purrr-Fection: In Search of An Elegant Solution to Conditional Data Frame Operations Leveraging Purrr
                            
                                R – How to join two data frames by nearest time-date?
                            
                                How to effectively visualize a recursive function?
                            
                                gganimate with changing scales (axis limits)
                            
                                How to "save" click events in Leaflet Shiny map
                            
                                R - WordCloud2 does not always render the most frequent words

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With