I'm trying to figure out the arguments for gather
in the tidyr
package.
I looked at the documentation, and the syntax looks like:
gather(data, key, value, ..., na.rm = FALSE, convert = FALSE)
There is an example in the help files:
stocks <- data.frame(
time = as.Date('2009-01-01') + 0:9,
X = rnorm(10, 0, 1),
Y = rnorm(10, 0, 2),
Z = rnorm(10, 0, 4)
)
gather(stocks, stock, price, -time)
I'm curious about the last line:gather(stocks, stock, price, -time)
Here, stocks
is clearly the data we want to modify, which is fine.
So I can read that stock
and price
are arguments to a key value pair -- but how does this function decide how to select columns to create this key value pair? The original dataframe looks like this:
time X Y Z
2009-01-01 1.10177950 -1.1926213 -7.4149618
2009-01-02 0.75578151 -4.3705737 -0.3117843
2009-01-03 -0.23823356 -1.3497319 3.8742654
2009-01-04 0.98744470 -4.2381224 0.7397038
2009-01-05 0.74139013 -2.5303960 -5.5197743
I don't see any indication that we should use any combination of X
, Y
or Z
. When I'm using this function, I feel like I'm just choosing names for what I want the columns in my long formatted dataframe to be, and praying that gather
magically works. Come to think of it, I feel the same way when I use melt
.
Does gather
look at the column's type? How does it map from wide to long?
EDIT
Great answer below, great discussion below, and for anyone else wanting more info on the philosophy and use of the tidyr
package should definitely read this paper, although the vignette doesn't explain the syntax.
A gather () function is used for collecting (gather) multiple columns and converting them into a key-value pair. The column names get duplicated while using the gather (), i.e., the data gets repeated and forms the key-value pairs.
spread() turns a pair of key:value columns into a set of tidy columns. To use spread() , pass it the name of a data frame, then the name of the key column in the data frame, and then the name of the value column.
tidyr provides three main functions for tidying your messy data: gather() , separate() and spread() . Sometimes two variables are clumped together in one column. separate() allows you to tease them apart ( extract() works similarly but uses regexp groups instead of a splitting pattern or position).
In "tidyr", you specify the measure variables for gather
in the ...
argument. This is a little bit different conceptually from melt
, where many examples (even many answers here on SO) would show the use of the id.vars
argument (with the assumption that anything that is not specified as an ID is a measurement).
The ...
argument can also take a -
column name, as in the example you have shown. This basically says to "gather all of the columns except for this one". Another shorthand approach in gather
includes specifying a range of columns by using the colon, for example, gather(stocks, stock, price, X:Z)
.
You can compare gather
with melt
by looking at the code for the function. Here are the first few lines:
> tidyr:::gather_.data.frame
function (data, key_col, value_col, gather_cols, na.rm = FALSE,
convert = FALSE)
{
data2 <- reshape2::melt(data, measure.vars = gather_cols,
variable.name = key_col, value.name = value_col, na.rm = na.rm)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With