Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Referencing a dataframe recursively

Tags:

r

subset

Is there a way to have a dataframe refer to itself?

I find myself spending a lot of time writing things like y$Category1[is.na(y$Category1)]<-NULL which are hard to read and feel like a lot of slow repetitive typing. I wondered if there was something along the lines of:

y$Category1[is.na(self)] <- NULL I could use instead.

Thanks

like image 211
Tahnoon Pasha Avatar asked Nov 28 '12 22:11

Tahnoon Pasha


1 Answers

What a great question. Unfortunately, as @user295691 pointed out in the coments, the issue is with regards to referencing a vector twice: once as the object being indexed and once as the subject of a condition. It does appear impossible to avoid the double reference.

numericVector[cond(numericVector)] <- newVal

What I think we can do is have a nice and neat function so that instead of

 # this  
 y$Category1[is.na(y$Category1)] <- list(NULL)

 # we can have this: 
 NAtoNULL(y$Category1)

For example, the following functions wrap selfAssign() (below):

NAtoNULL(obj)      # Replaces NA values in obj with NULL.
NAtoVal(obj, val)  # Replaces NA values in obj with val.
selfReplace(obj, toReplace, val)  # Replaces toReplace values in obj with val

# and selfAssign can be called directly, but I'm not sure there would be a good reason to
selfAssign(obj, ind, val)  # equivalent to obj[ind] <- val

Example:

# sample df
df <- structure(list(subj=c("A",NA,"C","D","E",NA,"G"),temp=c(111L,112L,NA,114L,115L,116L,NA),size=c(0.7133,NA,0.7457,NA,0.0487,NA,0.8481)),.Names=c("subj","temp","size"),row.names=c(NA,-7L),class="data.frame")

df
  subj temp   size
1    A  111 0.7133
2 <NA>  112     NA
3    C   NA 0.7457
4    D  114     NA
5    E  115 0.0487
6 <NA>  116     NA
7    G   NA 0.8481

# Make some replacements
NAtoNULL(df$size)    # Replace all NA's in df$size wtih NULL's
NAtoVal(df$temp, 0)  # Replace all NA's in df$tmp wtih 0's
NAtoVal(df$subj, c("B", "E"))   # Replace all NA's in df$subj with alternating "B" and "E" 

# the modified df is now:  
df

  subj temp   size
1    A  111 0.7133
2    B  112   NULL
3    C    0 0.7457
4    D  114   NULL
5    E  115 0.0487
6    E  116   NULL
7    G    0 0.8481


# replace the 0's in temp for NA
selfReplace(df$temp, 0, NA)

# replace NULL's in size for 1's
selfReplace(df$size, NULL, 1)

# replace all "E"'s in subj with alternate c("E", "F")
selfReplace(df$subj, c("E"), c("E", "F"))

df

  subj temp   size
1    A  111 0.7133
2    B  112      1
3    C   NA 0.7457
4    D  114      1
5    E  115 0.0487
6    F  116      1
7    G   NA 0.8481

Right now this works for vectors, but will fail with *apply. I would love to get it working fully, especially with applying plyr. The key would be to modify


FUNCTIONS

The code for the functions are below.

An important point. This does not (yet!) work with *apply / plyr.
I believe it can by modifying the value of n and adjusting sys.parent(.) in match.call() but it still needs some fiddling. Any suggestions / modifications would be grealy appreciated

selfAssign <- function(self, ind, val, n=1, silent=FALSE) {
## assigns val to self[ind] in environment parent.frame(n)
## self should be a vector.  Currently will not work for matricies or data frames

  ## GRAB THE CORRECT MATCH CALL
  #--------------------------------------
      # if nested function, match.call appropriately
      if (class(match.call()) == "call") {
        mc <- (match.call(call=sys.call(sys.parent(1))))
      } else {
        mc <- match.call()
      }

      # needed in case self is complex (ie df$name)
      mc2 <- paste(as.expression(mc[[2]]))


  ## CLEAN UP ARGUMENT VALUES
  #--------------------------------------
      # replace logical indecies with numeric indecies
      if (is.logical(ind))
        ind <- which(ind) 

      # if no indecies will be selected, stop here
      if(identical(ind, integer(0)) || is.null(ind)) {
        if(!silent) warning("No indecies selected")
        return()
      }

      # if val is a string, we need to wrap it in quotes
      if (is.character(val))
        val <- paste('"', val, '"', sep="")

      # val cannot directly be NULL, must be list(NULL)
      if(is.null(val))
        val <- "list(NULL)"


  ## CREATE EXPRESSIONS AND EVAL THEM
  #--------------------------------------
     # create expressions to evaluate
     ret <- paste0("'[['(", mc2, ", ", ind, ") <- ", val)

     # evaluate in parent.frame(n)
     eval(parse(text=ret), envir=parent.frame(n))
}


NAtoNULL <- function(obj, n=1) {
  selfAssign(match.call()[[2]], is.na(obj), NULL, n=n+1)
}

NAtoVal <- function(obj, val, n=1) {
  selfAssign(match.call()[[2]], is.na(obj), val, n=n+1)  
}

selfReplace <- function(obj, toReplace, val, n=1) {
## replaces occurrences of toReplace within obj with val

  # determine ind based on value & length of toReplace
  # TODO:  this will not work properly for data frames, but neither will selfAssign, yet.
  if (is.null(toReplace)) {
    ind <- sapply(obj, function(x) is.null(x[[1]]))
  }  else if (is.na(toReplace)) {
    ind <- is.na(obj)
  } else  {
    if (length(obj) > 1) {    # note, this wont work for data frames
          ind <- obj %in% toReplace
    } else {
      ind <- obj == toReplace
    }
  } 

  selfAssign(match.call()[[2]], ind, val, n=n+1)  
}



  ## THIS SHOULD GO INSIDE NAtoNULL, NAtoVal etc. 

  # todo: modify for use with *apply
  if(substr(paste(as.expression(x1)), 1, 10) == "FUN(obj = ") {
      # PASS.  This should identify when the call is coming from *apply. 
      #  in such a case, need to increase n by 1 for apply & lapply.  Increase n by 2 for sapply      
      # I'm not sure the increase required for plyr functions
  }
like image 152
Ricardo Saporta Avatar answered Sep 27 '22 00:09

Ricardo Saporta