I have a dataframe with panel structure: 2 observations for each unit from two years:
library(tidyr)
mydf <- data.frame(
id = rep(1:3, rep(2,3)),
year = rep(c(2012, 2013), 3),
value = runif(6)
)
mydf
# id year value
#1 1 2012 0.09668064
#2 1 2013 0.62739399
#3 2 2012 0.45618433
#4 2 2013 0.60347152
#5 3 2012 0.84537624
#6 3 2013 0.33466030
I would like to reshape this data to wide format which can be done easily with tidyr::spread
. However, as the values of the year
variable are numbers, the names of my new variables become numbers as well which makes its further use harder.
spread(mydf, year, value)
# id 2012 2013
#1 1 0.09668064 0.6273940
#2 2 0.45618433 0.6034715
#3 3 0.84537624 0.3346603
I know I can easily rename the columns. However, if I would like to reshape within a chain with other operations, it becomes inconvenient. E.g. the following line obviously does not make sense.
library(dplyr)
mydf %>% spread(year, value) %>% filter(2012 > 0.5)
The following works but is not that concise:
tmp <- spread(mydf, year, value)
names(tmp) <- c("id", "y2012", "y2013")
filter(tmp, y2012 > 0.5)
Any idea how I can change the new variable names within spread
?
spread() returns a copy of your data set that has had the key and value columns removed. In their place, spread() adds a new column for each unique value of the key column. These unique values will form the column names of the new columns.
tidyr is new package that makes it easy to “tidy” your data. Tidy data is data that's easy to work with: it's easy to munge (with dplyr), visualise (with ggplot2 or ggvis) and model (with R's hundreds of modelling packages). The two most important properties of tidy data are: Each column is a variable.
The spread() function from the tidyr library can be helpful to spread a key-value pair across different columns. This function also helps reshape the data from long format to wide format. This function works exactly opposite of gather(). So, it is used contrarily.
I know some years has passed since this question was originally asked, but for posterity I want to also highlight the sep
argument of spread
. When not NULL
, it will be used as separator between the key name and values:
mydf %>%
spread(key = year, value = value, sep = "")
# id year2012 year2013
#1 1 0.15608322 0.6886531
#2 2 0.04598124 0.0792947
#3 3 0.16835445 0.1744542
This is not exactly as wanted in the question, but sufficient for my purposes. See ?spread
.
Update with tidyr 1.0.0: tidyr 1.0.0 have now introduced pivot_wider
(and pivot_longer
) which allows for more control in this respect with the arguments names_sep
and names_prefix
. So now the call would be:
mydf %>%
pivot_wider(names_from = year, values_from = value,
names_prefix = "year")
# # A tibble: 3 x 3
# id year2012 year2013
# <int> <dbl> <dbl>
# 1 1 0.347 0.388
# 2 2 0.565 0.924
# 3 3 0.406 0.296
To get exactly what was originally wanted (prefixing "y" only) you can of course now get that directly by simply having names_prefix = "y"
.
The names_sep
is used in case you gather over multiple columns as demonstrated below where I have added quarters to the data:
# Add quarters to data
mydf2 <- data.frame(
id = rep(1:3, each = 8),
year = rep(rep(c(2012, 2013), each = 4), 3),
quarter = rep(c("Q1","Q2","Q3","Q4"), 3),
value = runif(24)
)
head(mydf2)
# id year quarter value
# 1 1 2012 Q1 0.8651470
# 2 1 2012 Q2 0.3944423
# 3 1 2012 Q3 0.4580580
# 4 1 2012 Q4 0.2902604
# 5 1 2013 Q1 0.4751588
# 6 1 2013 Q2 0.6851755
mydf2 %>%
pivot_wider(names_from = c(year, quarter), values_from = value,
names_sep = "_", names_prefix = "y")
# # A tibble: 3 x 9
# id y2012_Q1 y2012_Q2 y2012_Q3 y2012_Q4 y2013_Q1 y2013_Q2 y2013_Q3 y2013_Q4
# <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 0.865 0.394 0.458 0.290 0.475 0.685 0.213 0.920
# 2 2 0.566 0.614 0.509 0.0515 0.974 0.916 0.681 0.509
# 3 3 0.968 0.615 0.670 0.748 0.723 0.996 0.247 0.449
You can use backticks
for column names starting with numbers and filter
should work as expected
mydf %>%
spread(year, value) %>%
filter(`2012` > 0.5)
# id 2012 2013
#1 3 0.8453762 0.3346603
Or another option would be using unite
to join two columns to a single columnn after creating a second column 'year1' with string 'y'.
mydf %>%
mutate(year1='y') %>%
unite(yearN, year1, year) %>%
spread(yearN, value) %>%
filter(y_2012 > 0.5)
# id y_2012 y_2013
#1 3 0.8453762 0.3346603
Even we can change the 'year' column within mutate
by using paste
mydf %>%
mutate(year=paste('y', year, sep="_")) %>%
spread(year, value) %>%
filter(y_2012 > 0.5)
Another option is to use the setNames()
function as the next thing in the pipe:
mydf %>%
spread(mydf, year, value) %>%
setNames( c("id", "y2012", "y2013") ) %>%
filter(y2012 > 0.5)
The only problem using setNames is that you have to know exactly what your columns will be when you spread()
them. Most of the time, that's not a problem, particularly if you're working semi-interactively.
But if you're missing a key/value pair in your original data, there's a chance it won't show up as a column, and you can end up naming your columns incorrectly without even knowing it. Granted, setNames()
will throw an error if the number of names doesn't match the number of columns, so you've got a bit of error checking built in.
Still, the convenience of using setNames()
has outweighed the risk more often than not for me.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With