Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

CSV order preserved in R

The order of my data is important. If I load a CSV into R using read.csv, is the order of the rows in the dataframe guaranteed to match that of the CSV?

How about if I load a bunch of CSVs and rbind them together and then use subset to get at the data I'm interested in?

For example:

1.csv

foo,bar
a,123
a,456
c,789

2.csv

foo,bar
d,987
a,999
b,654
a,321

Will the following:

data1<-read.csv("1.csv", header=T)
data2<-read.csv("2.csv", header=T)
all_data<-rbind(data1, data2)
filtered<-subset(all_data, foo=="a")

...always produce a filtered as:

   foo  bar
1    a  123
2    a  456
3    a  999
4    a  321

...and does this behaviour hold for arbitrary CSV inputs and filters?

like image 863
Xophmeister Avatar asked Jan 18 '17 18:01

Xophmeister


People also ask

Where is the CSV file stored in R?

The CSV file is stored in the current working directory with the name specified in the function write.csv (data frame, output CSV name) in R. The column X contains the row numbers of the original CSV file.

How to remove row numbers from a CSV file in R?

The CSV file is stored in the current working directory with the name specified in the function write.csv (data frame, output CSV name) in R. The column X contains the row numbers of the original CSV file. In order to remove it, we can specify an additional argument in the write.csv () function that set row names to FALSE.

How to create CSV file from existing data frame in R?

R can create csv file form existing data frame. The write.csv () function is used to create the csv file. This file gets created in the working directory. Here the column X comes from the data set newper. This can be dropped using additional parameters while writing the file.

How to order the data in R?

Moreover, you could also order the vector x by the index vector of the vector y. When working with a matrix or a data frame in R you could want to order the data by row or by column values.


3 Answers

Have a read through the source code for read.table. It uses the scan base function, which itself uses the file and textConnection functions. All of these appear to point toward you being able to read in data sequentially ("line" by "line" based on delimiter) and feeding it in.

function (file, header = FALSE, sep = "", quote = "\"'", dec = ".", 
    numerals = c("allow.loss", "warn.loss", "no.loss"), row.names, 
    col.names, as.is = !stringsAsFactors, na.strings = "NA", 
    colClasses = NA, nrows = -1, skip = 0, check.names = TRUE, 
    fill = !blank.lines.skip, strip.white = FALSE, blank.lines.skip = TRUE, 
    comment.char = "#", allowEscapes = FALSE, flush = FALSE, 
    stringsAsFactors = default.stringsAsFactors(), fileEncoding = "", 
    encoding = "unknown", text, skipNul = FALSE) 
{
    if (missing(file) && !missing(text)) {
        file <- textConnection(text, encoding = "UTF-8")
        encoding <- "UTF-8"
        on.exit(close(file))
    }
    if (is.character(file)) {
        file <- if (nzchar(fileEncoding)) 
            file(file, "rt", encoding = fileEncoding)
        else file(file, "rt")
        on.exit(close(file))
    }
    if (!inherits(file, "connection")) 
        stop("'file' must be a character string or connection")
    if (!isOpen(file, "rt")) {
        open(file, "rt")
        on.exit(close(file))
    }
    pbEncoding <- if (encoding %in% c("", "bytes", "UTF-8")) 
        encoding
    else "bytes"
    numerals <- match.arg(numerals)
    if (skip > 0L) 
        readLines(file, skip)
    nlines <- n0lines <- if (nrows < 0L) 
        5
    else min(5L, (header + nrows))
    lines <- .External(C_readtablehead, file, nlines, comment.char, 
        blank.lines.skip, quote, sep, skipNul)
    if (encoding %in% c("UTF-8", "latin1")) 
        Encoding(lines) <- encoding
    nlines <- length(lines)
    if (!nlines) {
        if (missing(col.names)) 
            stop("no lines available in input")
        rlabp <- FALSE
        cols <- length(col.names)
    }
    else {
        if (all(!nzchar(lines))) 
            stop("empty beginning of file")
        if (nlines < n0lines && file == 0L) {
            pushBack(c(lines, lines, ""), file, encoding = pbEncoding)
            on.exit((clearPushBack(stdin())))
        }
        else pushBack(c(lines, lines), file, encoding = pbEncoding)
        first <- scan(file, what = "", sep = sep, quote = quote, 
            nlines = 1, quiet = TRUE, skip = 0, strip.white = TRUE, 
            blank.lines.skip = blank.lines.skip, comment.char = comment.char, 
            allowEscapes = allowEscapes, encoding = encoding, 
            skipNul = skipNul)
        col1 <- if (missing(col.names)) 
            length(first)
        else length(col.names)
        col <- numeric(nlines - 1L)
        if (nlines > 1L) 
            for (i in seq_along(col)) col[i] <- length(scan(file, 
                what = "", sep = sep, quote = quote, nlines = 1, 
                quiet = TRUE, skip = 0, strip.white = strip.white, 
                blank.lines.skip = blank.lines.skip, comment.char = comment.char, 
                allowEscapes = allowEscapes, encoding = encoding, 
                skipNul = skipNul))
        cols <- max(col1, col)
        rlabp <- (cols - col1) == 1L
        if (rlabp && missing(header)) 
            header <- TRUE
        if (!header) 
            rlabp <- FALSE
        if (header) {
            .External(C_readtablehead, file, 1L, comment.char, 
                blank.lines.skip, quote, sep, skipNul)
            if (missing(col.names)) 
                col.names <- first
            else if (length(first) != length(col.names)) 
                warning("header and 'col.names' are of different lengths")
        }
        else if (missing(col.names)) 
            col.names <- paste0("V", 1L:cols)
        if (length(col.names) + rlabp < cols) 
            stop("more columns than column names")
        if (fill && length(col.names) > cols) 
            cols <- length(col.names)
        if (!fill && cols > 0L && length(col.names) > cols) 
            stop("more column names than columns")
        if (cols == 0L) 
            stop("first five rows are empty: giving up")
    }
    if (check.names) 
        col.names <- make.names(col.names, unique = TRUE)
    if (rlabp) 
        col.names <- c("row.names", col.names)
    nmColClasses <- names(colClasses)
    if (is.null(nmColClasses)) {
        if (length(colClasses) < cols) 
            colClasses <- rep_len(colClasses, cols)
    }
    else {
        tmp <- rep_len(NA_character_, cols)
        names(tmp) <- col.names
        i <- match(nmColClasses, col.names, 0L)
        if (any(i <= 0L)) 
            warning("not all columns named in 'colClasses' exist")
        tmp[i[i > 0L]] <- colClasses[i > 0L]
        colClasses <- tmp
    }
    what <- rep.int(list(""), cols)
    names(what) <- col.names
    colClasses[colClasses %in% c("real", "double")] <- "numeric"
    known <- colClasses %in% c("logical", "integer", "numeric", 
        "complex", "character", "raw")
    what[known] <- sapply(colClasses[known], do.call, list(0))
    what[colClasses %in% "NULL"] <- list(NULL)
    keep <- !sapply(what, is.null)
    data <- scan(file = file, what = what, sep = sep, quote = quote, 
        dec = dec, nmax = nrows, skip = 0, na.strings = na.strings, 
        quiet = TRUE, fill = fill, strip.white = strip.white, 
        blank.lines.skip = blank.lines.skip, multi.line = FALSE, 
        comment.char = comment.char, allowEscapes = allowEscapes, 
        flush = flush, encoding = encoding, skipNul = skipNul)
    nlines <- length(data[[which.max(keep)]])
    if (cols != length(data)) {
        warning("cols = ", cols, " != length(data) = ", length(data), 
            domain = NA)
        cols <- length(data)
    }
    if (is.logical(as.is)) {
        as.is <- rep_len(as.is, cols)
    }
    else if (is.numeric(as.is)) {
        if (any(as.is < 1 | as.is > cols)) 
            stop("invalid numeric 'as.is' expression")
        i <- rep.int(FALSE, cols)
        i[as.is] <- TRUE
        as.is <- i
    }
    else if (is.character(as.is)) {
        i <- match(as.is, col.names, 0L)
        if (any(i <= 0L)) 
            warning("not all columns named in 'as.is' exist")
        i <- i[i > 0L]
        as.is <- rep.int(FALSE, cols)
        as.is[i] <- TRUE
    }
    else if (length(as.is) != cols) 
        stop(gettextf("'as.is' has the wrong length %d  != cols = %d", 
            length(as.is), cols), domain = NA)
    do <- keep & !known
    if (rlabp) 
        do[1L] <- FALSE
    for (i in (1L:cols)[do]) {
        data[[i]] <- if (is.na(colClasses[i])) 
            type.convert(data[[i]], as.is = as.is[i], dec = dec, 
                numerals = numerals, na.strings = character(0L))
        else if (colClasses[i] == "factor") 
            as.factor(data[[i]])
        else if (colClasses[i] == "Date") 
            as.Date(data[[i]])
        else if (colClasses[i] == "POSIXct") 
            as.POSIXct(data[[i]])
        else methods::as(data[[i]], colClasses[i])
    }
    compactRN <- TRUE
    if (missing(row.names)) {
        if (rlabp) {
            row.names <- data[[1L]]
            data <- data[-1L]
            keep <- keep[-1L]
            compactRN <- FALSE
        }
        else row.names <- .set_row_names(as.integer(nlines))
    }
    else if (is.null(row.names)) {
        row.names <- .set_row_names(as.integer(nlines))
    }
    else if (is.character(row.names)) {
        compactRN <- FALSE
        if (length(row.names) == 1L) {
            rowvar <- (1L:cols)[match(col.names, row.names, 0L) == 
                1L]
            row.names <- data[[rowvar]]
            data <- data[-rowvar]
            keep <- keep[-rowvar]
        }
    }
    else if (is.numeric(row.names) && length(row.names) == 1L) {
        compactRN <- FALSE
        rlabp <- row.names
        row.names <- data[[rlabp]]
        data <- data[-rlabp]
        keep <- keep[-rlabp]
    }
    else stop("invalid 'row.names' specification")
    data <- data[keep]
    if (is.object(row.names) || !(is.integer(row.names))) 
        row.names <- as.character(row.names)
    if (!compactRN) {
        if (length(row.names) != nlines) 
            stop("invalid 'row.names' length")
        if (anyDuplicated(row.names)) 
            stop("duplicate 'row.names' are not allowed")
        if (anyNA(row.names)) 
            stop("missing values in 'row.names' are not allowed")
    }
    class(data) <- "data.frame"
    attr(data, "row.names") <- row.names
    data
}
like image 186
Kamil Avatar answered Oct 16 '22 03:10

Kamil


This is a basic code that you can use to double check results coming from read.csv and subset:

Compare read.csv with readLines

Here you have a code that compare the result coming from read.csv with readLines (function reading line by line a file)

  library("readr" )
  library("rlist")
  file1<-file.choose() #Select your csv file1
  file2<-file.choose() #Select your csv file2

  #readLines
  input_list<-strsplit(readLines(file1),",")
  db_readLines<-data.frame(list.rbind(input_list[2:length(input_list)]))
  names(db_readLines)<-input_list[[1]]

  #readd.csv
  db_readcsv<-read.csv(file1,header = T,sep = ",")

  #Comparison
  if ((sum(db_readcsv==db_readLines)/(nrow(db_readcsv)*ncol(db_readcsv)))==1)
  {
    cat("Same data.frame")
  } else
  {
    cat("Data.frames are differents")
  }

You can use it with your csv file to compare results and verify that read.csv preserves lines order as readLines.

Compare subset with rbind + basic filtering

About the second part of the question another easy test:

data1<-read.csv(file1, header=T,sep=",")
  data2<-read.csv(file2, header=T,sep=",")
  all_data<-rbind(data1, data2)
  filtered1<-subset(all_data, foo=="a")

  filtered2<-rbind(data1[data1$foo=="a",],data2[data2$foo=="a",])

  #Comparison
  if ((sum(filtered1==filtered2)/(nrow(filtered2)*ncol(filtered2)))==1)
  {
    cat("Same data.frame")
  } else
  {
    cat("Data.frames are differents")
  }

You can include this kind of tests in your code, but obviously this is inefficient and redundant.

like image 23
Terru_theTerror Avatar answered Oct 16 '22 03:10

Terru_theTerror


It is safe to assume that all of those functions (read.csv, rbind, and subset) are guaranteed to preserve the order of your data as in the original csv.

Personally, I prefer using dplyr::filter over base::subset. As explained in this answer the two work almost identically. The main difference is that subset comes with a warning in ?subset: "This is a convenience function intended for use interactively. For programming it is better to use the standard subsetting functions like [, and in particular the non-standard evaluation of argument subset can have unanticipated consequences." filter is designed to work robustly with the rest of dplyr and the tidyverse, both interactively and programmatically, and has a separate standard evaluation version filter_ for when necessary.. So perhaps filter is a safer bet, especially if you're already using the dplyr framework. The only disadvantage to filter that I've encountered is that it does not keep rownames, while subset does.

Either way, I really don't think you need to worry about rows being reshuffled. In my experience, all of these functions have always produced R objects ordered in terms of the original data. If you want to be ultra-careful, it wouldn't hurt to go with @user127649's suggestion and add a unique ID column as a back up. I'm always in favor of lazier options, but it might be worth peace of mind!

like image 23
Joy Avatar answered Oct 16 '22 02:10

Joy