In the interest of replication I like to keep a codebook with meta data for each data frame. A data codebook is: <blockquote> a written or computerized list that provides a clear and comprehensive description of the variables that will be included in the database. Marczyk et al (2010) </blockquote> I like to document the following attributes of a variable: <blockquote> <ul> <li>name</li> <li>description (label, format, scale, etc)</li> <li>source (e.g. World bank)</li> <li>source media (url and date accessed, CD and ISBN, or whatever)</li> <li>file name of the source data on disk (helps when merging codebooks)</li> <li>notes</li> </ul> </blockquote> For example, this is what I am implementing to document the variables in data frame mydata1 with 8 variables: <pre class="prettyprint"><code>code.book.mydata1 <- data.frame(variable.name=c(names(mydata1)), label=c("Label 1", "State name", "Personal identifier", "Income per capita, thousand of US$, constant year 2000 prices", "Unique id", "Calendar year", "blah", "bah"), source=rep("unknown",length(mydata1)), source_media=rep("unknown",length(mydata1)), filename = rep("unknown",length(mydata1)), notes = rep("unknown",length(mydata1)) ) </code></pre> I write a different codebook for each data set I read. When I merge data frames I will also merge the relevant aspects of their associated codebook, to document the final database. I do this by essentially copy pasting the code above and changing the arguments.

You could add any special attribute to any R object with the <code>attr</code> function. E.g.: <pre class="prettyprint"><code>x <- cars attr(x,"source") <- "Ezekiel, M. (1930) _Methods of Correlation Analysis_. Wiley." </code></pre> And see the given attribute in the structure of the object: <pre class="prettyprint"><code>> str(x) 'data.frame': 50 obs. of 2 variables: $ speed: num 4 4 7 7 8 9 10 10 10 11 ... $ dist : num 2 10 4 22 16 10 18 26 34 17 ... - attr(*, "source")= chr "Ezekiel, M. (1930) _Methods of Correlation Analysis_. Wiley." </code></pre> And could also load the specified attribute with the same <code>attr</code> function: <pre class="prettyprint"><code>> attr(x, "source") [1] "Ezekiel, M. (1930) _Methods of Correlation Analysis_. Wiley." </code></pre> If you only add new cases to your data frame, the given attribute will not be affected (see: <code>str(rbind(x,x))</code> while altering the structure will erease the given attributes (see: <code>str(cbind(x,x))</code>). <hr> UPDATE: based on comments If you want to list all non-standard attributes, check the following: <pre class="prettyprint"><code>setdiff(names(attributes(x)),c("names","row.names","class")) </code></pre> This will list all non-standard attributes (standard are: names, row.names, class in data frames). Based on that, you could write a short function to list all non-standard attributes and also the values. The following does work, though not in a neat way... You could improve it and make up a function :) First, define the uniqe (=non standard) attributes: <pre class="prettyprint"><code>uniqueattrs <- setdiff(names(attributes(x)),c("names","row.names","class")) </code></pre> And make a matrix which will hold the names and values: <pre class="prettyprint"><code>attribs <- matrix(0,0,2) </code></pre> Loop through the non-standard attributes and save in the matrix the names and values: <pre class="prettyprint"><code>for (i in 1:length(uniqueattrs)) { attribs <- rbind(attribs, c(uniqueattrs[i], attr(x,uniqueattrs[i]))) } </code></pre> Convert the matrix to a data frame and name the columns: <pre class="prettyprint"><code>attribs <- as.data.frame(attribs) names(attribs) <- c('name', 'value') </code></pre> And save in any format, eg.: <pre class="prettyprint"><code>write.csv(attribs, 'foo.csv') </code></pre> <hr> To your question about the variable labels, check the <code>read.spss</code> function from package foreign, as it does exactly what you need: saves the value labels in the attrs section. The main idea is that an attr could be a data frame or other object, so you do not need to make a unique "attr" for every variable, but make only one (e.g. named to "varable labels") and save all information there. You could call like: <code>attr(x, "variable.labels")['foo']</code> where 'foo' stands for the required variable name. But check the function cited above and also the imported data frames' attributes for more details. I hope these could help you to write the required functions in a lot neater way than I tried above! :)

A more advanced version would be to use S4 classes. For example, in bioconductor the ExpressionSet is used to store microarray data with its associated experimental meta data. The MIAME object described in Section 4.4, looks very similar to what you are after: <pre class="prettyprint"><code>experimentData <- new("MIAME", name = "Pierre Fermat", lab = "Francis Galton Lab", contact = "pfermat@lab.not.exist", title = "Smoking-Cancer Experiment", abstract = "An example ExpressionSet", url = "www.lab.not.exist", other = list(notes = "Created from text files")) </code></pre>

The <code>comment()</code> function might be useful here. It can set and query a comment attribute on an object, but has the advantage other normal attributes of not being printed. <pre class="prettyprint"><code>dat <- data.frame(A = 1:5, B = 1:5, C = 1:5) comment(dat$A) <- "Label 1" comment(dat$B) <- "Label 2" comment(dat$C) <- "Label 3" comment(dat) <- "data source is, sampled on 1-Jan-2011" </code></pre> which gives: <pre class="prettyprint"><code>> dat A B C 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4 5 5 5 5 > dat$A [1] 1 2 3 4 5 > comment(dat$A) [1] "Label 1" > comment(dat) [1] "data source is, sampled on 1-Jan-2011" </code></pre> Example of merging: <pre class="prettyprint"><code>> dat2 <- data.frame(D = 1:5) > comment(dat2$D) <- "Label 4" > dat3 <- cbind(dat, dat2) > comment(dat3$D) [1] "Label 4" </code></pre> but that looses the comment on <code>dat()</code>: <pre class="prettyprint"><code>> comment(dat3) NULL </code></pre> so those sorts of operations would need handling explicitly. To truly do what you want, you'll probably either need to write special versions of functions you use that maintain the comments/metadata during extraction/merge operations. Alternatively you might want to look into producing your own classes of objects - say as a list with a data frame and other components holding the metadata. Then write methods for the functions you want that preserve the meta data. An example along these lines is the zoo package which generates a list object for a time series with extra components holding the ordering and time/date info etc, but still works like a normal object from point of view of subsetting etc because the authors have provided methods for functions like <code>[</code> etc.

How to create, structure, maintain and update data codebooks in R?

Tags:

r

metadata

data-management

In the interest of replication I like to keep a codebook with meta data for each data frame. A data codebook is:

a written or computerized list that provides a clear and comprehensive description of the variables that will be included in the database. Marczyk et al (2010)

I like to document the following attributes of a variable:

name

description (label, format, scale, etc)

source (e.g. World bank)

source media (url and date accessed, CD and ISBN, or whatever)

file name of the source data on disk (helps when merging codebooks)

notes

For example, this is what I am implementing to document the variables in data frame mydata1 with 8 variables:

code.book.mydata1 <- data.frame(variable.name=c(names(mydata1)),
     label=c("Label 1",
              "State name",
              "Personal identifier",
              "Income per capita, thousand of US$, constant year 2000 prices",
              "Unique id",
              "Calendar year",
              "blah",
              "bah"),
      source=rep("unknown",length(mydata1)),
      source_media=rep("unknown",length(mydata1)),
      filename = rep("unknown",length(mydata1)),
      notes = rep("unknown",length(mydata1))
)

I write a different codebook for each data set I read. When I merge data frames I will also merge the relevant aspects of their associated codebook, to document the final database. I do this by essentially copy pasting the code above and changing the arguments.

744

asked Mar 17 '11 00:03

Fred

3 Answers

You could add any special attribute to any R object with the attr function. E.g.:

x <- cars
attr(x,"source") <- "Ezekiel, M. (1930) _Methods of Correlation Analysis_.  Wiley."

And see the given attribute in the structure of the object:

> str(x)
'data.frame':   50 obs. of  2 variables:
 $ speed: num  4 4 7 7 8 9 10 10 10 11 ...
 $ dist : num  2 10 4 22 16 10 18 26 34 17 ...
 - attr(*, "source")= chr "Ezekiel, M. (1930) _Methods of Correlation Analysis_.  Wiley."

And could also load the specified attribute with the same attr function:

> attr(x, "source")
[1] "Ezekiel, M. (1930) _Methods of Correlation Analysis_.  Wiley."

If you only add new cases to your data frame, the given attribute will not be affected (see: str(rbind(x,x)) while altering the structure will erease the given attributes (see: str(cbind(x,x))).

UPDATE: based on comments

If you want to list all non-standard attributes, check the following:

setdiff(names(attributes(x)),c("names","row.names","class"))

This will list all non-standard attributes (standard are: names, row.names, class in data frames).

Based on that, you could write a short function to list all non-standard attributes and also the values. The following does work, though not in a neat way... You could improve it and make up a function :)

First, define the uniqe (=non standard) attributes:

uniqueattrs <- setdiff(names(attributes(x)),c("names","row.names","class"))

And make a matrix which will hold the names and values:

attribs <- matrix(0,0,2)

Loop through the non-standard attributes and save in the matrix the names and values:

for (i in 1:length(uniqueattrs)) {
    attribs <- rbind(attribs, c(uniqueattrs[i], attr(x,uniqueattrs[i])))
}

Convert the matrix to a data frame and name the columns:

attribs <- as.data.frame(attribs)
names(attribs) <- c('name', 'value')

And save in any format, eg.:

write.csv(attribs, 'foo.csv')

To your question about the variable labels, check the read.spss function from package foreign, as it does exactly what you need: saves the value labels in the attrs section. The main idea is that an attr could be a data frame or other object, so you do not need to make a unique "attr" for every variable, but make only one (e.g. named to "varable labels") and save all information there. You could call like: attr(x, "variable.labels")['foo'] where 'foo' stands for the required variable name. But check the function cited above and also the imported data frames' attributes for more details.

I hope these could help you to write the required functions in a lot neater way than I tried above! :)

answered Nov 19 '22 06:11

daroczig

A more advanced version would be to use S4 classes. For example, in bioconductor the ExpressionSet is used to store microarray data with its associated experimental meta data.

The MIAME object described in Section 4.4, looks very similar to what you are after:

experimentData <- new("MIAME", name = "Pierre Fermat",
          lab = "Francis Galton Lab", contact = "[email protected]",
          title = "Smoking-Cancer Experiment", abstract = "An example ExpressionSet",
          url = "www.lab.not.exist", other = list(notes = "Created from text files"))

answered Nov 19 '22 05:11

csgillespie

The comment() function might be useful here. It can set and query a comment attribute on an object, but has the advantage other normal attributes of not being printed.

dat <- data.frame(A = 1:5, B = 1:5, C = 1:5)
comment(dat$A) <- "Label 1"
comment(dat$B) <- "Label 2"
comment(dat$C) <- "Label 3"
comment(dat) <- "data source is, sampled on 1-Jan-2011"

which gives:

> dat
  A B C
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
5 5 5 5
> dat$A
[1] 1 2 3 4 5
> comment(dat$A)
[1] "Label 1"
> comment(dat)
[1] "data source is, sampled on 1-Jan-2011"

Example of merging:

> dat2 <- data.frame(D = 1:5)
> comment(dat2$D) <- "Label 4"
> dat3 <- cbind(dat, dat2)
> comment(dat3$D)
[1] "Label 4"

but that looses the comment on dat():

> comment(dat3)
NULL

so those sorts of operations would need handling explicitly. To truly do what you want, you'll probably either need to write special versions of functions you use that maintain the comments/metadata during extraction/merge operations. Alternatively you might want to look into producing your own classes of objects - say as a list with a data frame and other components holding the metadata. Then write methods for the functions you want that preserve the meta data.

An example along these lines is the zoo package which generates a list object for a time series with extra components holding the ordering and time/date info etc, but still works like a normal object from point of view of subsetting etc because the authors have provided methods for functions like [ etc.

answered Nov 19 '22 05:11

Gavin Simpson

Related questions
                            
                                DISTINCT results in ORA-01791: not a SELECTed expression
                            
                                Weightx and Weighty in Java GridBagLayout
                            
                                How to parse dynamic JSON fields with GSON?
                            
                                How do Python parsers handle indentation?
                            
                                Git Commit Generation Numbers
                            
                                Is "final" final at runtime?
                            
                                IE9 table has random rows which are offset at random columns
                            
                                Strptime with Timezone
                            
                                HDF5 Example code
                            
                                Glassfish 3 security - Form based authentication using a JDBC Realm
                            
                                Bidirectional multi-valued map in Java
                            
                                User defined literal arguments are not constexpr?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With