Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reformat data frame using with months spread and ordered by their calender order in R [duplicate]

Tags:

dataframe

r

tidyr

I've a data.frame given below. I am trying to move it from long format to wide format. Using the spreading column being dates. using spread function from the tidyr package presents two fold problem:

  • The data is filled with NA
  • The months get ordered by alphabetic order

So how do I go from

30-Apr-2015 632.95
28-May-2015 532.95
25-Jun-2015 232.95

to

30-Apr-2015 28-May-2015 25-Jun-2015
632.95      532.95      232.95

instead I end up at

30-Apr-2015 25-Jun-2015 28-May-2015 
632.95      NA      232.95
NA          232.95  NA
NA          NA      532.95

Actual dates don't matter, but their relative ordering matter, i.e. the nearest month data should go to first column, followed by the other two month data, in successive order. This is necessary because I'm using rbind on the result

The code I've tried

data = tidyr::spread(data, key = EXPIRY_DT, value = CHG_IN_OI)
colnames(data)[3:5] = c('Month1', 'Month2', 'Month3')

The data.frame is as given below:

data = structure(list(SYMBOL = c("A", "A", "A", "B", "B", "B", "C", 
"C", "C", "D", "D", "D"), EXPIRY_DT = c("30-Apr-2015", "28-May-2015", 
"25-Jun-2015", "30-Apr-2015", "28-May-2015", "25-Jun-2015", "30-Apr-2015", 
"28-May-2015", "25-Jun-2015", "30-Apr-2015", "28-May-2015", "25-Jun-2015"
), OPEN = c(1750, 1789, 0, 1627.5, 1653.3, 0, 632.95, 644.1, 
0, 317.8, 319.5, 0), HIGH = c(1788.05, 1795, 0, 1656.5, 1653.3, 
0, 646.4, 650.5, 0, 324.6, 326.65, 0), LOW = c(1746, 1760, 0, 
1627.5, 1645.45, 0, 629.65, 635, 0, 315.85, 318.4, 0), CLOSE = c(1782.3, 
1791.85, 1695.1, 1642.95, 1646.75, 1613.9, 640.85, 644.35, 614.6, 
320.55, 322.35, 310.85), SETTLE_PR = c(1782.3, 1791.85, 1804.8, 
1642.95, 1653.85, 1664.35, 640.85, 644.35, 649.1, 320.55, 322.35, 
325.35), CONTRACTS = c(1469L, 78L, 0L, 2638L, 14L, 0L, 4964L, 
181L, 0L, 3416L, 82L, 0L), VALUE = c(6496.96, 347.91, 0, 10830.05, 
57.68, 0, 15869.41, 583.38, 0, 10969.31, 264.93, 0), OPEN_INT = c(1353750L, 
8500L, 0L, 1377250L, 17000L, 0L, 6264000L, 98000L, 0L, 8228000L, 
216000L, 0L), CHG_IN_OI = c(15250L, 1250L, 0L, -21000L, 1500L, 
0L, 73500L, 6000L, 0L, -192000L, 13000L, 0L), TIMESTAMP = c("10-APR-2015", 
"10-APR-2015", "10-APR-2015", "10-APR-2015", "10-APR-2015", "10-APR-2015", 
"10-APR-2015", "10-APR-2015", "10-APR-2015", "10-APR-2015", "10-APR-2015", 
"10-APR-2015")), .Names = c("SYMBOL", "EXPIRY_DT", "OPEN", "HIGH", 
"LOW", "CLOSE", "SETTLE_PR", "CONTRACTS", "VALUE", "OPEN_INT", 
"CHG_IN_OI", "TIMESTAMP"), row.names = 40:51, class = "data.frame")

Thanks for reading.

Edit:

After comments from @akrun adding the expected output. Because the values for each dates are different, i.e. would need the data for each month placed one after another, with the column names are being appended with the string 'Month1/2/3' instead of the actual date. Hope that helps.

output = structure(list(SYMBOL = c("A", "B", "C", "D"), TIMESTAMP = c("10-Apr-15", 
"10-Apr-15", "10-Apr-15", "10-Apr-15"), OPEN.Month1 = c(1750, 
1627.5, 632.95, 317.8), HIGH.Month1 = c(1788.05, 1656.5, 646.4, 
324.6), LOW.Month1 = c(1746, 1627.5, 629.65, 315.85), CLOSE.Month1 = c(1782.3, 
1642.95, 640.85, 320.55), SETTLE_PR.Month1 = c(1782.3, 1642.95, 
640.85, 320.55), CONTRACTS.Month1 = c(1469L, 2638L, 4964L, 3416L
), VALUE.Month1 = c(6496.96, 10830.05, 15869.41, 10969.31), OPEN_INT.Month1 = c(1353750L, 
1377250L, 6264000L, 8228000L), CHG_IN_OI.Month1 = c(15250L, -21000L, 
73500L, -192000L), OPEN.Month2 = c(1789, 1653.3, 644.1, 319.5
), HIGH.Month2 = c(1795, 1653.3, 650.5, 326.65), LOW.Month2 = c(1760, 
1645.45, 635, 318.4), CLOSE.Month2 = c(1791.85, 1646.75, 644.35, 
322.35), SETTLE_PR.Month2 = c(1791.85, 1653.85, 644.35, 322.35
), CONTRACTS.Month2 = c(78L, 14L, 181L, 82L), VALUE.Month2 = c(347.91, 
57.68, 583.38, 264.93), OPEN_INT.Month2 = c(8500L, 17000L, 98000L, 
216000L), CHG_IN_OI.Month2 = c(1250L, 1500L, 6000L, 13000L), 
    OPEN.Month3 = c(0L, 0L, 0L, 0L), HIGH.Month3 = c(0L, 0L, 
    0L, 0L), LOW.Month3 = c(0L, 0L, 0L, 0L), CLOSE.Month3 = c(1695.1, 
    1613.9, 614.6, 310.85), SETTLE_PR.Month3 = c(1804.8, 1664.35, 
    649.1, 325.35), CONTRACTS.Month3 = c(0L, 0L, 0L, 0L), VALUE.Month3 = c(0L, 
    0L, 0L, 0L), OPEN_INT.Month3 = c(0L, 0L, 0L, 0L), CHG_IN_OI.Month3 = c(0L, 
    0L, 0L, 0L)), .Names = c("SYMBOL", "TIMESTAMP", "OPEN.Month1", 
"HIGH.Month1", "LOW.Month1", "CLOSE.Month1", "SETTLE_PR.Month1", 
"CONTRACTS.Month1", "VALUE.Month1", "OPEN_INT.Month1", "CHG_IN_OI.Month1", 
"OPEN.Month2", "HIGH.Month2", "LOW.Month2", "CLOSE.Month2", "SETTLE_PR.Month2", 
"CONTRACTS.Month2", "VALUE.Month2", "OPEN_INT.Month2", "CHG_IN_OI.Month2", 
"OPEN.Month3", "HIGH.Month3", "LOW.Month3", "CLOSE.Month3", "SETTLE_PR.Month3", 
"CONTRACTS.Month3", "VALUE.Month3", "OPEN_INT.Month3", "CHG_IN_OI.Month3"
), class = "data.frame", row.names = c(NA, -4L))
like image 712
Frash Avatar asked Apr 12 '15 04:04

Frash


People also ask

How do I change the order of data frames in R?

To sort a data frame in R, use the order( ) function. By default, sorting is ASCENDING. Prepend the sorting variable by a minus sign to indicate DESCENDING order.

How do I use the spread function in R?

To use spread() , pass it the name of a data frame, then the name of the key column in the data frame, and then the name of the value column. Pass the column names as they are; do not use quotes. To tidy table2 , you would pass spread() the key column and then the value column.

What is melt function in R?

The melt() function in R programming is an in-built function. It enables us to reshape and elongate the data frames in a user-defined manner. It organizes the data values in a long data frame format.

How do I convert a DataFrame to a date in R?

Method 1: Using as.POSIXct() method A string type date object can be converted to POSIXct object, using them as. POSIXct(date) method in R. “ct” in POSIXct denotes calendar time, it stores the number of seconds since the origin. It takes as input the string date object and the format specifier.


2 Answers

We could use the devel version of data.table ie. 'v1.9.5' which can take multiple "value.vars". Instructions to install the devel version are here.

Change the 'data.frame' to 'data.table' (setDT(data)). Create a "Month" column by pasting the 'Month' with the row number for each "SYMBOL". Then, we can use dcast, specifying the value.var as the columns '3:11'.

library(data.table)
res <- dcast(setDT(data)[, Month:=paste0('Month', 1:.N), by=SYMBOL],
                 SYMBOL+TIMESTAMP~Month, value.var=names(data)[3:11])

If we need to change the column names to the particular format in the 'output', use setnames. I rearranged the order of the columns as in the expected result ('output') and changed the data.table to data.frame (setDF)

setnames(res, sub('([^_]+)_(.*)', '\\2.\\1', colnames(res)))
res1 <- setDF(res[,names(output), with=FALSE])
res1
#  SYMBOL   TIMESTAMP OPEN.Month1 HIGH.Month1 LOW.Month1 CLOSE.Month1
#1      A 10-APR-2015     1750.00     1788.05    1746.00      1782.30
#2      B 10-APR-2015     1627.50     1656.50    1627.50      1642.95
#3      C 10-APR-2015      632.95      646.40     629.65       640.85
#4      D 10-APR-2015      317.80      324.60     315.85       320.55
#  SETTLE_PR.Month1 CONTRACTS.Month1 VALUE.Month1 OPEN_INT.Month1
#1          1782.30             1469      6496.96         1353750
#2          1642.95             2638     10830.05         1377250
#3           640.85             4964     15869.41         6264000
#4           320.55             3416     10969.31         8228000
#  CHG_IN_OI.Month1 OPEN.Month2 HIGH.Month2 LOW.Month2 CLOSE.Month2
#1            15250      1789.0     1795.00    1760.00      1791.85
#2           -21000      1653.3     1653.30    1645.45      1646.75
#3            73500       644.1      650.50     635.00       644.35
#4          -192000       319.5      326.65     318.40       322.35
#  SETTLE_PR.Month2 CONTRACTS.Month2 VALUE.Month2 OPEN_INT.Month2
#1          1791.85               78       347.91            8500
#2          1653.85               14        57.68           17000
#3           644.35              181       583.38           98000
#4           322.35               82       264.93          216000
#  CHG_IN_OI.Month2 OPEN.Month3 HIGH.Month3 LOW.Month3 CLOSE.Month3  
#1             1250           0           0          0      1695.10
#2             1500           0           0          0      1613.90
#3             6000           0           0          0       614.60
#4            13000           0           0          0       310.85
#  SETTLE_PR.Month3 CONTRACTS.Month3 VALUE.Month3 OPEN_INT.Month3
#1          1804.80                0            0               0
#2          1664.35                0            0               0
#3           649.10                0            0               0
#4           325.35                0            0               0
#  CHG_IN_OI.Month3
#1                0
#2                0
#3                0
#4                0

The TIMESTAMP column in 'output' was in different format. Changed the format in the 'res1' and it is the same as the expected output.

res1$TIMESTAMP <- format(as.Date(res1$TIMESTAMP, '%d-%b-%Y'), '%d-%b-%y')
all.equal(output, res1)
#[1] TRUE

Or we can use reshape from base R, which does take multiple value columns. Just like we created a sequence earlier, here we can use ave to create 'MONTH' column and use that as timevar within the reshape.

 data$MONTH <- with(data, paste0('MONTH', ave(seq_along(SYMBOL), 
                    SYMBOL, FUN=seq_along)))
 res2 <- reshape(data[-2], idvar=c('SYMBOL', 'TIMESTAMP'), 
                          timevar='MONTH', direction='wide')
like image 118
akrun Avatar answered Nov 15 '22 05:11

akrun


Extremely tough problem. I've devised a solution that comes very close to your sample output; you should be able to clean up the little discrepancies afterward (see the end of my answer for a summary of discrepancies).

Assumptions

First, let me start with my assumptions:

  • The input data.frame data is already properly ordered with respect to the EXPIRY_DT (independently for each SYMBOL). Your sample input satisfies this assumption. Now, as a general recommendation, you should try to always use ISO 8601 for date formats, which naturally sort lexicographically, and would naturally allow you to coerce to Date format in R. Given your input date formats, if you wanted to guarantee proper order, you would have to call as.Date() and pass the input format, and then make a call to order(). Instead of including this in my code, I've just made the assumption that the data is already ordered.
  • Because your sample output seems to have unified all values of TIMESTAMP for each SYMBOL, I've made the assumption that those two columns comprise a multicolumn primary key to the data. If this is incorrect, you can simply change the keys variable I define in my code to not include TIMESTAMP. But if that is the case, then you will get additional TIMESTAMP.Month{mnum} columns in the output (which you could remove afterward, if desired).

Code

keys <- c('SYMBOL','TIMESTAMP');
mnum <- ave(1:nrow(data), data[,keys], FUN=seq_along );
mnum;
##  [1] 1 2 3 1 2 3 1 2 3 1 2 3
mdata <- lapply(1:max(mnum), function(x) setNames(data[mnum==x,],ifelse(names(data)%in%keys,names(data),paste0(names(data),'.Month',x))) );
mdata;
## [[1]]
##    SYMBOL EXPIRY_DT.Month1 OPEN.Month1 HIGH.Month1 LOW.Month1 CLOSE.Month1 SETTLE_PR.Month1 CONTRACTS.Month1 VALUE.Month1 OPEN_INT.Month1 CHG_IN_OI.Month1   TIMESTAMP
## 40      A      30-Apr-2015     1750.00     1788.05    1746.00      1782.30          1782.30             1469      6496.96         1353750            15250 10-APR-2015
## 43      B      30-Apr-2015     1627.50     1656.50    1627.50      1642.95          1642.95             2638     10830.05         1377250           -21000 10-APR-2015
## 46      C      30-Apr-2015      632.95      646.40     629.65       640.85           640.85             4964     15869.41         6264000            73500 10-APR-2015
## 49      D      30-Apr-2015      317.80      324.60     315.85       320.55           320.55             3416     10969.31         8228000          -192000 10-APR-2015
## 
## [[2]]
##    SYMBOL EXPIRY_DT.Month2 OPEN.Month2 HIGH.Month2 LOW.Month2 CLOSE.Month2 SETTLE_PR.Month2 CONTRACTS.Month2 VALUE.Month2 OPEN_INT.Month2 CHG_IN_OI.Month2   TIMESTAMP
## 41      A      28-May-2015      1789.0     1795.00    1760.00      1791.85          1791.85               78       347.91            8500             1250 10-APR-2015
## 44      B      28-May-2015      1653.3     1653.30    1645.45      1646.75          1653.85               14        57.68           17000             1500 10-APR-2015
## 47      C      28-May-2015       644.1      650.50     635.00       644.35           644.35              181       583.38           98000             6000 10-APR-2015
## 50      D      28-May-2015       319.5      326.65     318.40       322.35           322.35               82       264.93          216000            13000 10-APR-2015
## 
## [[3]]
##    SYMBOL EXPIRY_DT.Month3 OPEN.Month3 HIGH.Month3 LOW.Month3 CLOSE.Month3 SETTLE_PR.Month3 CONTRACTS.Month3 VALUE.Month3 OPEN_INT.Month3 CHG_IN_OI.Month3   TIMESTAMP
## 42      A      25-Jun-2015           0           0          0      1695.10          1804.80                0            0               0                0 10-APR-2015
## 45      B      25-Jun-2015           0           0          0      1613.90          1664.35                0            0               0                0 10-APR-2015
## 48      C      25-Jun-2015           0           0          0       614.60           649.10                0            0               0                0 10-APR-2015
## 51      D      25-Jun-2015           0           0          0       310.85           325.35                0            0               0                0 10-APR-2015
## 
res <- Reduce(function(x,y) merge(x,y,by=keys,all=T), mdata );
res;
##   SYMBOL   TIMESTAMP EXPIRY_DT.Month1 OPEN.Month1 HIGH.Month1 LOW.Month1 CLOSE.Month1 SETTLE_PR.Month1 CONTRACTS.Month1 VALUE.Month1 OPEN_INT.Month1 CHG_IN_OI.Month1 EXPIRY_DT.Month2 OPEN.Month2 HIGH.Month2 LOW.Month2 CLOSE.Month2 SETTLE_PR.Month2 CONTRACTS.Month2 VALUE.Month2 OPEN_INT.Month2 CHG_IN_OI.Month2 EXPIRY_DT.Month3 OPEN.Month3 HIGH.Month3 LOW.Month3 CLOSE.Month3 SETTLE_PR.Month3 CONTRACTS.Month3 VALUE.Month3 OPEN_INT.Month3 CHG_IN_OI.Month3
## 1      A 10-APR-2015      30-Apr-2015     1750.00     1788.05    1746.00      1782.30          1782.30             1469      6496.96         1353750            15250      28-May-2015      1789.0     1795.00    1760.00      1791.85          1791.85               78       347.91            8500             1250      25-Jun-2015           0           0          0      1695.10          1804.80                0            0               0                0
## 2      B 10-APR-2015      30-Apr-2015     1627.50     1656.50    1627.50      1642.95          1642.95             2638     10830.05         1377250           -21000      28-May-2015      1653.3     1653.30    1645.45      1646.75          1653.85               14        57.68           17000             1500      25-Jun-2015           0           0          0      1613.90          1664.35                0            0               0                0
## 3      C 10-APR-2015      30-Apr-2015      632.95      646.40     629.65       640.85           640.85             4964     15869.41         6264000            73500      28-May-2015       644.1      650.50     635.00       644.35           644.35              181       583.38           98000             6000      25-Jun-2015           0           0          0       614.60           649.10                0            0               0                0
## 4      D 10-APR-2015      30-Apr-2015      317.80      324.60     315.85       320.55           320.55             3416     10969.31         8228000          -192000      28-May-2015       319.5      326.65     318.40       322.35           322.35               82       264.93          216000            13000      25-Jun-2015           0           0          0       310.85           325.35                0            0               0                0

Explanation

As you can see, the core of my solution involves splitting the input data into separate data.frames by month number, which makes possible adding suffixes to all non-key columns independently for each split, and then repeatedly calling merge() to merge them all together.

The mnum vector stands for "month number". You could consider it to be a kind of "detached" column of the input data object; it represents the month number within the primary key group to which each row in data belongs. I use ave() to call seq_along() once for each group, which generates a sequential integer vector of length equal to the group size (i.e. number of rows in the group), which ave() maps back to the positions of the group rows in the original data object.

The mdata object is a list of data.frames, where each component represents one month number. The actual extraction of the rows with a particular month number is done with a simple logical index operation:

data[mnum==x,]

where x is the mnum element, iterated over 1:max(mnum) by lapply(). The suffixing of non-key column names is done using setNames(), deriving the replacement column names as follows:

ifelse(names(data)%in%keys,names(data),paste0(names(data),'.Month',x))

The above leaves the names of key-columns untouched, but appends '.Month{mnum}' to the names of all non-key-columns.

Finally, all month-number splits must be merged back into one data.frame. I thought I'd be able to use a single call to merge() (possibly with a little help from do.call()) to do this, but was disappointed to discover that it only takes two arguments to merge, x and y (also see Simultaneously merge multiple data.frames in a list). Thus, I needed to call Reduce() to achieve the repeated calls. The all=T argument would be important if your different symbols had different numbers of expiry dates; then "short" symbols would not be represented on the RHS of the final merge(s), and thus would be dropped, if all=T was not passed.

Discrepancies

My output doesn't exactly match your sample output. Here are the discrepancies:

  • Your sample output seems to have changed the format of the TIMESTAMP column from what it was in the input, for example, 10-APR-2015 changed to 10-Apr-15. My code does not touch the format of TIMESTAMP.
  • Your sample output is lacking the EXPIRY_DT columns, which my solution retains under their suffixed EXPIRY_DT.Month1, EXPIRY_DT.Month2, etc. names. You can remove those columns afterward with grep() on names() and negative indexing, if so desired.
like image 21
bgoldst Avatar answered Nov 15 '22 03:11

bgoldst