I have a dataframe with panel structure: 2 observations for each unit from two years: <pre class="prettyprint lang-r prettyprint-override"><code>library(tidyr) mydf <- data.frame( id = rep(1:3, rep(2,3)), year = rep(c(2012, 2013), 3), value = runif(6) ) mydf # id year value #1 1 2012 0.09668064 #2 1 2013 0.62739399 #3 2 2012 0.45618433 #4 2 2013 0.60347152 #5 3 2012 0.84537624 #6 3 2013 0.33466030 </code></pre> I would like to reshape this data to wide format which can be done easily with <code>tidyr::spread</code>. However, as the values of the <code>year</code> variable are numbers, the names of my new variables become numbers as well which makes its further use harder. <pre class="prettyprint lang-r prettyprint-override"><code>spread(mydf, year, value) # id 2012 2013 #1 1 0.09668064 0.6273940 #2 2 0.45618433 0.6034715 #3 3 0.84537624 0.3346603 </code></pre> I know I can easily rename the columns. However, if I would like to reshape within a chain with other operations, it becomes inconvenient. E.g. the following line obviously does not make sense. <pre class="prettyprint lang-r prettyprint-override"><code>library(dplyr) mydf %>% spread(year, value) %>% filter(2012 > 0.5) </code></pre> The following works but is not that concise: <pre class="prettyprint lang-r prettyprint-override"><code>tmp <- spread(mydf, year, value) names(tmp) <- c("id", "y2012", "y2013") filter(tmp, y2012 > 0.5) </code></pre> Any idea how I can change the new variable names within <code>spread</code>?

I know some years has passed since this question was originally asked, but for posterity I want to also highlight the <code>sep</code> argument of <code>spread</code>. When not <code>NULL</code>, it will be used as separator between the key name and values: <pre class="prettyprint"><code>mydf %>% spread(key = year, value = value, sep = "") # id year2012 year2013 #1 1 0.15608322 0.6886531 #2 2 0.04598124 0.0792947 #3 3 0.16835445 0.1744542 </code></pre> This is not exactly as wanted in the question, but sufficient for my purposes. See <code>?spread</code>. Update with tidyr 1.0.0: tidyr 1.0.0 have now introduced <code>pivot_wider</code> (and <code>pivot_longer</code>) which allows for more control in this respect with the arguments <code>names_sep</code> and <code>names_prefix</code>. So now the call would be: <pre class="prettyprint"><code>mydf %>% pivot_wider(names_from = year, values_from = value, names_prefix = "year") # # A tibble: 3 x 3 # id year2012 year2013 # <int> <dbl> <dbl> # 1 1 0.347 0.388 # 2 2 0.565 0.924 # 3 3 0.406 0.296 </code></pre> To get exactly what was originally wanted (prefixing "y" only) you can of course now get that directly by simply having <code>names_prefix = "y"</code>. The <code>names_sep</code> is used in case you gather over multiple columns as demonstrated below where I have added quarters to the data: <pre class="prettyprint"><code># Add quarters to data mydf2 <- data.frame( id = rep(1:3, each = 8), year = rep(rep(c(2012, 2013), each = 4), 3), quarter = rep(c("Q1","Q2","Q3","Q4"), 3), value = runif(24) ) head(mydf2) # id year quarter value # 1 1 2012 Q1 0.8651470 # 2 1 2012 Q2 0.3944423 # 3 1 2012 Q3 0.4580580 # 4 1 2012 Q4 0.2902604 # 5 1 2013 Q1 0.4751588 # 6 1 2013 Q2 0.6851755 mydf2 %>% pivot_wider(names_from = c(year, quarter), values_from = value, names_sep = "_", names_prefix = "y") # # A tibble: 3 x 9 # id y2012_Q1 y2012_Q2 y2012_Q3 y2012_Q4 y2013_Q1 y2013_Q2 y2013_Q3 y2013_Q4 # <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> # 1 1 0.865 0.394 0.458 0.290 0.475 0.685 0.213 0.920 # 2 2 0.566 0.614 0.509 0.0515 0.974 0.916 0.681 0.509 # 3 3 0.968 0.615 0.670 0.748 0.723 0.996 0.247 0.449 </code></pre>

You can use <code>backticks</code> for column names starting with numbers and <code>filter</code> should work as expected <pre class="prettyprint"><code> mydf %>% spread(year, value) %>% filter(`2012` > 0.5) # id 2012 2013 #1 3 0.8453762 0.3346603 </code></pre> Or another option would be using <code>unite</code> to join two columns to a single columnn after creating a second column 'year1' with string 'y'. <pre class="prettyprint"><code> mydf %>% mutate(year1='y') %>% unite(yearN, year1, year) %>% spread(yearN, value) %>% filter(y_2012 > 0.5) # id y_2012 y_2013 #1 3 0.8453762 0.3346603 </code></pre> Even we can change the 'year' column within <code>mutate</code> by using <code>paste</code> <pre class="prettyprint"><code> mydf %>% mutate(year=paste('y', year, sep="_")) %>% spread(year, value) %>% filter(y_2012 > 0.5) </code></pre>

Another option is to use the <code>setNames()</code> function as the next thing in the pipe: <pre class="prettyprint"><code>mydf %>% spread(mydf, year, value) %>% setNames( c("id", "y2012", "y2013") ) %>% filter(y2012 > 0.5) </code></pre> The only problem using setNames is that you have to know exactly what your columns will be when you <code>spread()</code> them. Most of the time, that's not a problem, particularly if you're working semi-interactively. But if you're missing a key/value pair in your original data, there's a chance it won't show up as a column, and you can end up naming your columns incorrectly without even knowing it. Granted, <code>setNames()</code> will throw an error if the number of names doesn't match the number of columns, so you've got a bit of error checking built in. Still, the convenience of using <code>setNames()</code> has outweighed the risk more often than not for me.

How to control new variables' names after tidyr's spread?

Tags:

r

dplyr

tidyr

I have a dataframe with panel structure: 2 observations for each unit from two years:

library(tidyr)
mydf <- data.frame(
    id = rep(1:3, rep(2,3)), 
    year = rep(c(2012, 2013), 3), 
    value = runif(6)
)
mydf
#  id year      value
#1  1 2012 0.09668064
#2  1 2013 0.62739399
#3  2 2012 0.45618433
#4  2 2013 0.60347152
#5  3 2012 0.84537624
#6  3 2013 0.33466030

I would like to reshape this data to wide format which can be done easily with tidyr::spread. However, as the values of the year variable are numbers, the names of my new variables become numbers as well which makes its further use harder.

spread(mydf, year, value)
#  id       2012      2013
#1  1 0.09668064 0.6273940
#2  2 0.45618433 0.6034715
#3  3 0.84537624 0.3346603

I know I can easily rename the columns. However, if I would like to reshape within a chain with other operations, it becomes inconvenient. E.g. the following line obviously does not make sense.

library(dplyr)
mydf %>% spread(year, value) %>% filter(2012 > 0.5)

The following works but is not that concise:

tmp <- spread(mydf, year, value)
names(tmp) <- c("id", "y2012", "y2013")
filter(tmp, y2012 > 0.5)

Any idea how I can change the new variable names within spread?

481

asked Aug 03 '15 13:08

janosdivenyi

3 Answers

I know some years has passed since this question was originally asked, but for posterity I want to also highlight the sep argument of spread. When not NULL, it will be used as separator between the key name and values:

mydf %>% 
 spread(key = year, value = value, sep = "")
#  id   year2012  year2013
#1  1 0.15608322 0.6886531
#2  2 0.04598124 0.0792947
#3  3 0.16835445 0.1744542

This is not exactly as wanted in the question, but sufficient for my purposes. See ?spread.

Update with tidyr 1.0.0: tidyr 1.0.0 have now introduced pivot_wider (and pivot_longer) which allows for more control in this respect with the arguments names_sep and names_prefix. So now the call would be:

mydf %>% 
  pivot_wider(names_from = year, values_from = value,
              names_prefix = "year")
# # A tibble: 3 x 3
#        id year2012 year2013
#     <int>    <dbl>    <dbl>
#   1     1    0.347    0.388
#   2     2    0.565    0.924
#   3     3    0.406    0.296

To get exactly what was originally wanted (prefixing "y" only) you can of course now get that directly by simply having names_prefix = "y".

The names_sep is used in case you gather over multiple columns as demonstrated below where I have added quarters to the data:

# Add quarters to data
mydf2 <- data.frame(
  id = rep(1:3, each = 8), 
  year = rep(rep(c(2012, 2013), each = 4), 3), 
  quarter  = rep(c("Q1","Q2","Q3","Q4"), 3),
  value = runif(24)
)
head(mydf2)
# id year quarter     value
# 1  1 2012      Q1 0.8651470
# 2  1 2012      Q2 0.3944423
# 3  1 2012      Q3 0.4580580
# 4  1 2012      Q4 0.2902604
# 5  1 2013      Q1 0.4751588
# 6  1 2013      Q2 0.6851755

mydf2 %>% 
  pivot_wider(names_from = c(year, quarter), values_from = value,
              names_sep = "_", names_prefix = "y")
# # A tibble: 3 x 9
#      id  y2012_Q1  y2012_Q2  y2012_Q3  y2012_Q4  y2013_Q1  y2013_Q2  y2013_Q3  y2013_Q4 
#   <int>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>
# 1     1     0.865     0.394     0.458    0.290      0.475     0.685     0.213     0.920
# 2     2     0.566     0.614     0.509    0.0515     0.974     0.916     0.681     0.509
# 3     3     0.968     0.615     0.670    0.748      0.723     0.996     0.247     0.449

140

answered Oct 18 '22 21:10

Anders Ellern Bilgrau

You can use backticks for column names starting with numbers and filter should work as expected

  mydf %>%
      spread(year, value) %>%
      filter(`2012` > 0.5)
  #  id      2012      2013
  #1  3 0.8453762 0.3346603

Or another option would be using unite to join two columns to a single columnn after creating a second column 'year1' with string 'y'.

  mydf %>%
     mutate(year1='y') %>%
     unite(yearN, year1, year) %>%
     spread(yearN, value) %>%
     filter(y_2012 > 0.5)
 #   id    y_2012    y_2013
 #1  3 0.8453762 0.3346603

Even we can change the 'year' column within mutate by using paste

 mydf %>%
     mutate(year=paste('y', year, sep="_")) %>%
     spread(year, value) %>%
     filter(y_2012 > 0.5)

answered Oct 18 '22 19:10

akrun

Another option is to use the setNames() function as the next thing in the pipe:

mydf %>%
    spread(mydf, year, value) %>%
    setNames( c("id", "y2012", "y2013") ) %>%
    filter(y2012 > 0.5)

The only problem using setNames is that you have to know exactly what your columns will be when you spread() them. Most of the time, that's not a problem, particularly if you're working semi-interactively.

But if you're missing a key/value pair in your original data, there's a chance it won't show up as a column, and you can end up naming your columns incorrectly without even knowing it. Granted, setNames() will throw an error if the number of names doesn't match the number of columns, so you've got a bit of error checking built in.

Still, the convenience of using setNames() has outweighed the risk more often than not for me.

answered Oct 18 '22 20:10

crazybilly

Related questions
                            
                                rvest how to select a specific css node by id
                            
                                Go : When will json.Unmarshal to struct return error?
                            
                                Populate Django database
                            
                                Get playground to display all loop results
                            
                                Can a regular expression be used as a key in a dictionary?
                            
                                Is there hardware support for 128bit integers in modern processors?
                            
                                Is it possible to group rows this way using MySQL?
                            
                                Realm Swift Models separate or not?
                            
                                org.eclipse.jetty.io.EofException: Early EOF thrown while uploading large file
                            
                                drop event not working angular 2
                            
                                How to find out what linux capabilities a process requires to work?
                            
                                Angular2 - subscribe to Service variable changes

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With