Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

dcast renaming all variables to start with a number

So I've got data that looks like this:

           id year principal interest
 1: 011000600 2013      0.00     0.00
 2: 011000600 2014    544.03     0.00
 3: 011000700 2013      0.00     0.00
 4: 011000700 2014      0.01     0.00
 5: 011000800 2013    363.44    12.79
 6: 011000800 2014   2005.98     0.00
 7: 011000900 2013      0.00     0.00
 8: 011000900 2014      0.00     0.00
 9: 011001000 2013      0.00     0.00
10: 011001000 2014      0.00     0.00
11: 011001100 2013      0.00     0.00
12: 011001100 2014   1723.24     0.00
13: 011001560 2013      0.00     0.00
14: 011001560 2014      0.00     0.00
15: 011001650 2013      0.00     0.00
16: 011001650 2014      0.00     0.00

(basically a longitudinal sample of a bunch of variables)

The data is on the large side so I'm using data.table for everything. I reshape it to get each id unique by row:

datam<-melt(data,id=c("id","year"))
data1<-dcast.data.table(datam,id~...)

This yields:

          id 2013_principal 2013_interest 2014_principal 2014_interest
1: 011000600           0.00          0.00         544.03             0
2: 011000700           0.00          0.00           0.01             0
3: 011000800         363.44         12.79        2005.98             0
4: 011000900           0.00          0.00           0.00             0
5: 011001000           0.00          0.00           0.00             0
6: 011001100           0.00          0.00        1723.24             0

This is course the form of data that I want, but having column names start with numbers is a pain in the keester.

Any suggestions for how to deal with this? I'd much rather have:

          id principal_2013 interest_2013 principal_2014 interest_2014
1: 011000600           0.00          0.00         544.03             0
2: 011000700           0.00          0.00           0.01             0
3: 011000800         363.44         12.79        2005.98             0
4: 011000900           0.00          0.00           0.00             0
5: 011001000           0.00          0.00           0.00             0
6: 011001100           0.00          0.00        1723.24             0

(switching the year to be a suffix) I've tried being more explicit when casting, e.g.

data2<-dcast.data.table(datam,id~year+...)
data3<-dcast.data.table(datam,id~...+year)

To no avail:

data2
          id 2013_principal 2013_interest 2014_principal 2014_interest
1: 011000600           0.00          0.00         544.03             0
2: 011000700           0.00          0.00           0.01             0
3: 011000800         363.44         12.79        2005.98             0
4: 011000900           0.00          0.00           0.00             0
5: 011001000           0.00          0.00           0.00             0
6: 011001100           0.00          0.00        1723.24             0

data3
          id 2013_principal 2013_interest 2014_principal 2014_interest
1: 011000600           0.00          0.00         544.03             0
2: 011000700           0.00          0.00           0.01             0
3: 011000800         363.44         12.79        2005.98             0
4: 011000900           0.00          0.00           0.00             0
5: 011001000           0.00          0.00           0.00             0
6: 011001100           0.00          0.00        1723.24             0

Seems pretty silly for the naming convention of dcast to default to this style, given that I imagine this type of reshaping is ubiquitous.

I've also tried patching things up ex-post given some other posts I've found (e.g. here), but it runs unfathomably slow (there are ~400 variables to rename in the full data set)

names(data)<-ifelse(substr(names(data),1,2) %in% c("19","20"),    
                    paste(substr(names(data),6,nchar(data)),
                          substr(names(data),1,4),sep="_")   ,
                    names(copy))

(I'm trying to find all the variables starting with years--19xx or 20xx--and trying to swap the beginning and end)

like image 957
MichaelChirico Avatar asked Mar 19 '23 16:03

MichaelChirico


1 Answers

FR #5675 is now implemented in v1.9.3. From NEWS

o dcast.data.table(dt, a ~ ... + b) now generates the column names with values from 'b' coming last. Closes #5675.

That is, now you can do:

dcast.data.table(datam, id ~ ... + year)

#          id principal_2013 principal_2014 interest_2013 interest_2014
# 1: 11000600           0.00         544.03          0.00             0
# 2: 11000700           0.00           0.01          0.00             0
# 3: 11000800         363.44        2005.98         12.79             0
# 4: 11000900           0.00           0.00          0.00             0
# 5: 11001000           0.00           0.00          0.00             0
# 6: 11001100           0.00        1723.24          0.00             0
# 7: 11001560           0.00           0.00          0.00             0
# 8: 11001650           0.00           0.00          0.00             0

and the column names will have year values at the end, as expected.


Also added documentation - Doc #5676. From NEWS:

o ?dcast.data.table now explains how the names are generated for the columns that are being casted. Closes #5676.

Now ?dcast.data.table contains the added line:

Names for columns that are being cast are generated in the same order (separated by a _) from the (unique) values in each column mentioned in the formula RHS.

HTH

like image 160
Arun Avatar answered Apr 03 '23 18:04

Arun