Logo Questions Linux Laravel Mysql Ubuntu Git Menu

Closures as solution to data merging idiom

I'm trying to wrap my head around closures, and I think I've found a case where they might be helpful.

I have the following pieces to work with:

  • A set of regular expressions designed to clean state names, housed in a function
  • A data.frame with state names (of the standardized form that the function above creates) and state ID codes, to link the two (the "merge map")

The idea is, given some data.frame with sloppy state names (is the capital listed as "Washington, D.C.", "washington DC", "District of Columbia", etc.?), to have a single function return the same data.frame with the state name column removed and only the state ID codes remaining. Then subsequent merges can happen consistently.

I can do this in any number of ways, but one way that seems to be particularly elegant would be to house the merge map and the regular expression and the code process everything inside a closure (following the idea that a closure is a function with data).

Question 1: Is this a reasonable idea?

Question 2: If so, how do I do it in R?

Here's a stupid simple clean state names function that works on the example data:

cleanStateNames <- function(x) {
  x <- tolower(x)
  x[grepl("columbia",x)] <- "DC"

Here's some example data that the eventual function will be run on:

dat <- structure(list(state = c("Alabama", "Alaska", "Arizona", "Arkansas", 
"California", "Colorado", "Connecticut", "Delaware", "District of Columbia", 
"Florida"), pop08 = structure(c(29L, 44L, 40L, 18L, 25L, 30L, 
22L, 48L, 36L, 13L), .Label = c("1,050,788", "1,288,198", "1,315,809", 
"1,316,456", "1,523,816", "1,783,432", "1,814,468", "1,984,356", 
"10,003,422", "11,485,910", "12,448,279", "12,901,563", "18,328,340", 
"19,490,297", "2,600,167", "2,736,424", "2,802,134", "2,855,390", 
"2,938,618", "24,326,974", "3,002,555", "3,501,252", "3,642,361", 
"3,790,060", "36,756,666", "4,269,245", "4,410,796", "4,479,800", 
"4,661,900", "4,939,456", "5,220,393", "5,627,967", "5,633,597", 
"5,911,605", "532,668", "591,833", "6,214,888", "6,376,792", 
"6,497,967", "6,500,180", "6,549,224", "621,270", "641,481", 
"686,293", "7,769,089", "8,682,661", "804,194", "873,092", "9,222,414", 
"9,685,744", "967,440"), class = "factor")), .Names = c("state", 
"pop08"), row.names = c(NA, 10L), class = "data.frame")

And a sample merge map (the actual one links FIPS codes to states, so it can't be trivially generated):

merge_map <- data.frame(state=dat$state, id=seq(10) )

EDIT Building off of crippledlambda's answer below, here's an attempt at the function:

prepForMerge <- local({
  merge_map <- structure(list(state = c("alabama", "alaska", "arizona", "arkansas",  "california", "colorado", "connecticut", "delaware", "DC", "florida" ), id = 1:10), .Names = c("state", "id"), row.names = c(NA, -10L ), class = "data.frame")
    replace_merge_map=function(new_merge_map) {
      merge_map <<- new_merge_map
    show_merge_map=function() {
    return_prepped_data.frame=function(dat) {
      dat$state <- cleanStateNames(dat$state)
      dat <- merge(dat,merge_map)
      dat <- subset(dat,select=c(-state))

> prepForMerge$return_prepped_data.frame(dat)
        pop08 id
1   4,661,900  1
2     686,293  2
3   6,500,180  3
4   2,855,390  4
5  36,756,666  5
6   4,939,456  6
7   3,501,252  7
8     591,833  9
9     873,092  8
10 18,328,340 10

Two problems remain before I'd consider this question solved:

  1. Calling prepForMerge$return_prepped_data.frame(dat) is painful each time. Any way to have a default function such that I could just call prepForMerge(dat)? I'm guessing not given how it's implemented, but perhaps there's at least a convention for the default fxn....

  2. How do I avoid mixing the data and code in the merge_map definition? Ideally I'd clean merge_map elsewhere, then just grab it inside the closure and store that.

like image 603
Ari B. Friedman Avatar asked Oct 17 '11 17:10

Ari B. Friedman

1 Answers

I may be missing the point of your question, but this is one way in which you can use a closure:

> replaceStateNames <- local({
+   statenames <- c("Alabama", "Alaska", "Arizona", "Arkansas", 
+                   "California", "Colorado", "Connecticut", "Delaware",
+                   "District of Columbia", "Florida")
+   function(patt,newtext) {
+     statenames <- tolower(statenames)
+     statenames[grepl(patt,statenames)] <- newtext
+     statenames
+   }
+ })
> replaceStateNames("columbia","DC")
 [1] "alabama"     "alaska"      "arizona"     "arkansas"    "california" 
 [6] "colorado"    "connecticut" "delaware"    "DC"          "florida"    
> replaceStateNames("alaska","palincountry")
 [1] "alabama"              "palincountry"         "arizona"             
 [4] "arkansas"             "california"           "colorado"            
 [7] "connecticut"          "delaware"             "district of columbia"
[10] "florida"             
> replaceStateNames("florida","jebbushland")
 [1] "alabama"              "alaska"               "arizona"             
 [4] "arkansas"             "california"           "colorado"            
 [7] "connecticut"          "delaware"             "district of columbia"
[10] "jebbushland"    

But to generalize, you can replace statenames with your data frame definition, and return a function (or list of functions) which uses this data frame without having to pass it as an argument to the function call. Example (but note I've used the ignore.case=TRUE argument in grepl):

> replaceStateNames <- local({
+   statenames <- c("Alabama", "Alaska", "Arizona", "Arkansas", 
+                   "California", "Colorado", "Connecticut", "Delaware",
+                   "District of Columbia", "Florida")
+   list(justreturn=function(patt,newtext) {
+     statenames[grepl(patt,statenames,ignore.case=TRUE)] <- newtext
+     statenames
+   },reassign=function(patt,newtext) {
+     statenames <<- replace(statenames,grepl(patt,statenames,ignore.case=TRUE),newtext)
+     statenames
+   })
+ })

Just like the first example:

> replaceStateNames$justreturn("columbia","DC")
 [1] "Alabama"     "Alaska"      "Arizona"     "Arkansas"    "California" 
 [6] "Colorado"    "Connecticut" "Delaware"    "DC"          "Florida"    

Just returns the lexically-scoped value of statenames to check that the original values are unchanged:

> replaceStateNames$justreturn("shouldnotmatch","anythinghere")
 [1] "Alabama"              "Alaska"               "Arizona"             
 [4] "Arkansas"             "California"           "Colorado"            
 [7] "Connecticut"          "Delaware"             "District of Columbia"
[10] "Florida"             

Do the same thing, but make the change "permanent":

> replaceStateNames$reassign("columbia","DC")
 [1] "Alabama"     "Alaska"      "Arizona"     "Arkansas"    "California" 
 [6] "Colorado"    "Connecticut" "Delaware"    "DC"          "Florida"    

And note that the value of statenames attached to these functions has changed.

> replaceStateNames$justreturn("shouldnotmatch","anythinghere")
 [1] "Alabama"     "Alaska"      "Arizona"     "Arkansas"    "California" 
 [6] "Colorado"    "Connecticut" "Delaware"    "DC"          "Florida"    

In any case, you can replace statenames with a data frame, and these simple functions with a "merge map" or any other mapping you desire.


Speaking of "merge", is this what you're looking for? An implementation of first ?merge example using a closure:

> authors <- data.frame(surname = I(c("Tukey", "Venables", "Tierney", "Ripley", "McNeil")),
+                       nationality = c("US", "Australia", "US", "UK", "Australia"),
+                       deceased = c("yes", rep("no", 4)))
> books <- data.frame(name = I(c("Tukey", "Venables", "Tierney",
+                       "Ripley", "Ripley", "McNeil", "R Core")),
+                     title = c("Exploratory Data Analysis",
+                       "Modern Applied Statistics ...",
+                       "LISP-STAT",
+                       "Spatial Statistics", "Stochastic Simulation",
+                       "Interactive Data Analysis",
+                       "An Introduction to R"),
+                     other.author = c(NA, "Ripley", NA, NA, NA, NA,
+                       "Venables & Smith"))
> mergewithauthors <- with(list(authors=authors),function(books) 
+   merge(authors, books, by.x = "surname", by.y = "name"))
> mergewithauthors(books)
   surname nationality deceased                         title other.author
1   McNeil   Australia       no     Interactive Data Analysis         <NA>
2   Ripley          UK       no            Spatial Statistics         <NA>
3   Ripley          UK       no         Stochastic Simulation         <NA>
4  Tierney          US       no                     LISP-STAT         <NA>
5    Tukey          US      yes     Exploratory Data Analysis         <NA>
6 Venables   Australia       no Modern Applied Statistics ...       Ripley

Edit 2

To read file into an object which will be lexically bound, you can either do

fn <- local({
  data <- read.csv("filename.csv")
  function(...) {


fn <- with(list(data=read.csv("filename.csv")),
     function(...) {


fn <- with(local(data <- read.csv("filename.csv")),
     function(...) {

and so on. (I assume the function(...) will have to do with your "merge_map"). You can also use evalq in place of local. To "bring in" objects residing in the global space (or enclosing environment), you can just do the following

globalobj <- value      ## could be from read.csv()
fn <- local({
  localobj <- globalobj ## if globalobj is not locally defined, 
                        ## R will look in enclosing environment
                        ## in this case, the globalenv()
  function(...) {

then modifying globalobj later will not change localobj attached to the function (since almost(?) everything in R follows pass-by-value semantics). You can also use with instead of local as shown in examples above.

like image 198
hatmatrix Avatar answered Sep 30 '22 08:09
