I've a data.frame given below. I am trying to move it from long format to wide format. Using the spreading column being dates. using spread function from the tidyr
package presents two fold problem:
So how do I go from
30-Apr-2015 632.95
28-May-2015 532.95
25-Jun-2015 232.95
to
30-Apr-2015 28-May-2015 25-Jun-2015
632.95 532.95 232.95
instead I end up at
30-Apr-2015 25-Jun-2015 28-May-2015
632.95 NA 232.95
NA 232.95 NA
NA NA 532.95
Actual dates don't matter, but their relative ordering matter, i.e. the nearest month data should go to first column, followed by the other two month data, in successive order. This is necessary because I'm using rbind
on the result
The code I've tried
data = tidyr::spread(data, key = EXPIRY_DT, value = CHG_IN_OI)
colnames(data)[3:5] = c('Month1', 'Month2', 'Month3')
The data.frame is as given below:
data = structure(list(SYMBOL = c("A", "A", "A", "B", "B", "B", "C",
"C", "C", "D", "D", "D"), EXPIRY_DT = c("30-Apr-2015", "28-May-2015",
"25-Jun-2015", "30-Apr-2015", "28-May-2015", "25-Jun-2015", "30-Apr-2015",
"28-May-2015", "25-Jun-2015", "30-Apr-2015", "28-May-2015", "25-Jun-2015"
), OPEN = c(1750, 1789, 0, 1627.5, 1653.3, 0, 632.95, 644.1,
0, 317.8, 319.5, 0), HIGH = c(1788.05, 1795, 0, 1656.5, 1653.3,
0, 646.4, 650.5, 0, 324.6, 326.65, 0), LOW = c(1746, 1760, 0,
1627.5, 1645.45, 0, 629.65, 635, 0, 315.85, 318.4, 0), CLOSE = c(1782.3,
1791.85, 1695.1, 1642.95, 1646.75, 1613.9, 640.85, 644.35, 614.6,
320.55, 322.35, 310.85), SETTLE_PR = c(1782.3, 1791.85, 1804.8,
1642.95, 1653.85, 1664.35, 640.85, 644.35, 649.1, 320.55, 322.35,
325.35), CONTRACTS = c(1469L, 78L, 0L, 2638L, 14L, 0L, 4964L,
181L, 0L, 3416L, 82L, 0L), VALUE = c(6496.96, 347.91, 0, 10830.05,
57.68, 0, 15869.41, 583.38, 0, 10969.31, 264.93, 0), OPEN_INT = c(1353750L,
8500L, 0L, 1377250L, 17000L, 0L, 6264000L, 98000L, 0L, 8228000L,
216000L, 0L), CHG_IN_OI = c(15250L, 1250L, 0L, -21000L, 1500L,
0L, 73500L, 6000L, 0L, -192000L, 13000L, 0L), TIMESTAMP = c("10-APR-2015",
"10-APR-2015", "10-APR-2015", "10-APR-2015", "10-APR-2015", "10-APR-2015",
"10-APR-2015", "10-APR-2015", "10-APR-2015", "10-APR-2015", "10-APR-2015",
"10-APR-2015")), .Names = c("SYMBOL", "EXPIRY_DT", "OPEN", "HIGH",
"LOW", "CLOSE", "SETTLE_PR", "CONTRACTS", "VALUE", "OPEN_INT",
"CHG_IN_OI", "TIMESTAMP"), row.names = 40:51, class = "data.frame")
Thanks for reading.
Edit:
After comments from @akrun adding the expected output. Because the values for each dates are different, i.e. would need the data for each month placed one after another, with the column names are being appended with the string 'Month1/2/3' instead of the actual date. Hope that helps.
output = structure(list(SYMBOL = c("A", "B", "C", "D"), TIMESTAMP = c("10-Apr-15",
"10-Apr-15", "10-Apr-15", "10-Apr-15"), OPEN.Month1 = c(1750,
1627.5, 632.95, 317.8), HIGH.Month1 = c(1788.05, 1656.5, 646.4,
324.6), LOW.Month1 = c(1746, 1627.5, 629.65, 315.85), CLOSE.Month1 = c(1782.3,
1642.95, 640.85, 320.55), SETTLE_PR.Month1 = c(1782.3, 1642.95,
640.85, 320.55), CONTRACTS.Month1 = c(1469L, 2638L, 4964L, 3416L
), VALUE.Month1 = c(6496.96, 10830.05, 15869.41, 10969.31), OPEN_INT.Month1 = c(1353750L,
1377250L, 6264000L, 8228000L), CHG_IN_OI.Month1 = c(15250L, -21000L,
73500L, -192000L), OPEN.Month2 = c(1789, 1653.3, 644.1, 319.5
), HIGH.Month2 = c(1795, 1653.3, 650.5, 326.65), LOW.Month2 = c(1760,
1645.45, 635, 318.4), CLOSE.Month2 = c(1791.85, 1646.75, 644.35,
322.35), SETTLE_PR.Month2 = c(1791.85, 1653.85, 644.35, 322.35
), CONTRACTS.Month2 = c(78L, 14L, 181L, 82L), VALUE.Month2 = c(347.91,
57.68, 583.38, 264.93), OPEN_INT.Month2 = c(8500L, 17000L, 98000L,
216000L), CHG_IN_OI.Month2 = c(1250L, 1500L, 6000L, 13000L),
OPEN.Month3 = c(0L, 0L, 0L, 0L), HIGH.Month3 = c(0L, 0L,
0L, 0L), LOW.Month3 = c(0L, 0L, 0L, 0L), CLOSE.Month3 = c(1695.1,
1613.9, 614.6, 310.85), SETTLE_PR.Month3 = c(1804.8, 1664.35,
649.1, 325.35), CONTRACTS.Month3 = c(0L, 0L, 0L, 0L), VALUE.Month3 = c(0L,
0L, 0L, 0L), OPEN_INT.Month3 = c(0L, 0L, 0L, 0L), CHG_IN_OI.Month3 = c(0L,
0L, 0L, 0L)), .Names = c("SYMBOL", "TIMESTAMP", "OPEN.Month1",
"HIGH.Month1", "LOW.Month1", "CLOSE.Month1", "SETTLE_PR.Month1",
"CONTRACTS.Month1", "VALUE.Month1", "OPEN_INT.Month1", "CHG_IN_OI.Month1",
"OPEN.Month2", "HIGH.Month2", "LOW.Month2", "CLOSE.Month2", "SETTLE_PR.Month2",
"CONTRACTS.Month2", "VALUE.Month2", "OPEN_INT.Month2", "CHG_IN_OI.Month2",
"OPEN.Month3", "HIGH.Month3", "LOW.Month3", "CLOSE.Month3", "SETTLE_PR.Month3",
"CONTRACTS.Month3", "VALUE.Month3", "OPEN_INT.Month3", "CHG_IN_OI.Month3"
), class = "data.frame", row.names = c(NA, -4L))
To sort a data frame in R, use the order( ) function. By default, sorting is ASCENDING. Prepend the sorting variable by a minus sign to indicate DESCENDING order.
To use spread() , pass it the name of a data frame, then the name of the key column in the data frame, and then the name of the value column. Pass the column names as they are; do not use quotes. To tidy table2 , you would pass spread() the key column and then the value column.
The melt() function in R programming is an in-built function. It enables us to reshape and elongate the data frames in a user-defined manner. It organizes the data values in a long data frame format.
Method 1: Using as.POSIXct() method A string type date object can be converted to POSIXct object, using them as. POSIXct(date) method in R. “ct” in POSIXct denotes calendar time, it stores the number of seconds since the origin. It takes as input the string date object and the format specifier.
We could use the devel
version of data.table
ie. 'v1.9.5' which can take multiple "value.vars". Instructions to install the devel version are here
.
Change the 'data.frame' to 'data.table' (setDT(data)
). Create a "Month" column by pasting the 'Month' with the row number for each "SYMBOL". Then, we can use dcast
, specifying the value.var
as the columns '3:11'.
library(data.table)
res <- dcast(setDT(data)[, Month:=paste0('Month', 1:.N), by=SYMBOL],
SYMBOL+TIMESTAMP~Month, value.var=names(data)[3:11])
If we need to change the column names to the particular format in the 'output', use setnames
. I rearranged the order of the columns as in the expected result ('output') and changed the data.table to data.frame (setDF
)
setnames(res, sub('([^_]+)_(.*)', '\\2.\\1', colnames(res)))
res1 <- setDF(res[,names(output), with=FALSE])
res1
# SYMBOL TIMESTAMP OPEN.Month1 HIGH.Month1 LOW.Month1 CLOSE.Month1
#1 A 10-APR-2015 1750.00 1788.05 1746.00 1782.30
#2 B 10-APR-2015 1627.50 1656.50 1627.50 1642.95
#3 C 10-APR-2015 632.95 646.40 629.65 640.85
#4 D 10-APR-2015 317.80 324.60 315.85 320.55
# SETTLE_PR.Month1 CONTRACTS.Month1 VALUE.Month1 OPEN_INT.Month1
#1 1782.30 1469 6496.96 1353750
#2 1642.95 2638 10830.05 1377250
#3 640.85 4964 15869.41 6264000
#4 320.55 3416 10969.31 8228000
# CHG_IN_OI.Month1 OPEN.Month2 HIGH.Month2 LOW.Month2 CLOSE.Month2
#1 15250 1789.0 1795.00 1760.00 1791.85
#2 -21000 1653.3 1653.30 1645.45 1646.75
#3 73500 644.1 650.50 635.00 644.35
#4 -192000 319.5 326.65 318.40 322.35
# SETTLE_PR.Month2 CONTRACTS.Month2 VALUE.Month2 OPEN_INT.Month2
#1 1791.85 78 347.91 8500
#2 1653.85 14 57.68 17000
#3 644.35 181 583.38 98000
#4 322.35 82 264.93 216000
# CHG_IN_OI.Month2 OPEN.Month3 HIGH.Month3 LOW.Month3 CLOSE.Month3
#1 1250 0 0 0 1695.10
#2 1500 0 0 0 1613.90
#3 6000 0 0 0 614.60
#4 13000 0 0 0 310.85
# SETTLE_PR.Month3 CONTRACTS.Month3 VALUE.Month3 OPEN_INT.Month3
#1 1804.80 0 0 0
#2 1664.35 0 0 0
#3 649.10 0 0 0
#4 325.35 0 0 0
# CHG_IN_OI.Month3
#1 0
#2 0
#3 0
#4 0
The TIMESTAMP
column in 'output' was in different format. Changed the format in the 'res1' and it is the same as the expected output.
res1$TIMESTAMP <- format(as.Date(res1$TIMESTAMP, '%d-%b-%Y'), '%d-%b-%y')
all.equal(output, res1)
#[1] TRUE
Or we can use reshape
from base R
, which does take multiple value columns. Just like we created a sequence earlier, here we can use ave
to create 'MONTH' column and use that as timevar
within the reshape
.
data$MONTH <- with(data, paste0('MONTH', ave(seq_along(SYMBOL),
SYMBOL, FUN=seq_along)))
res2 <- reshape(data[-2], idvar=c('SYMBOL', 'TIMESTAMP'),
timevar='MONTH', direction='wide')
Extremely tough problem. I've devised a solution that comes very close to your sample output; you should be able to clean up the little discrepancies afterward (see the end of my answer for a summary of discrepancies).
First, let me start with my assumptions:
data
is already properly ordered with respect to the EXPIRY_DT
(independently for each SYMBOL
). Your sample input satisfies this assumption. Now, as a general recommendation, you should try to always use ISO 8601 for date formats, which naturally sort lexicographically, and would naturally allow you to coerce to Date
format in R. Given your input date formats, if you wanted to guarantee proper order, you would have to call as.Date()
and pass the input format, and then make a call to order()
. Instead of including this in my code, I've just made the assumption that the data is already ordered.TIMESTAMP
for each SYMBOL
, I've made the assumption that those two columns comprise a multicolumn primary key to the data. If this is incorrect, you can simply change the keys
variable I define in my code to not include TIMESTAMP
. But if that is the case, then you will get additional TIMESTAMP.Month{mnum}
columns in the output (which you could remove afterward, if desired).keys <- c('SYMBOL','TIMESTAMP');
mnum <- ave(1:nrow(data), data[,keys], FUN=seq_along );
mnum;
## [1] 1 2 3 1 2 3 1 2 3 1 2 3
mdata <- lapply(1:max(mnum), function(x) setNames(data[mnum==x,],ifelse(names(data)%in%keys,names(data),paste0(names(data),'.Month',x))) );
mdata;
## [[1]]
## SYMBOL EXPIRY_DT.Month1 OPEN.Month1 HIGH.Month1 LOW.Month1 CLOSE.Month1 SETTLE_PR.Month1 CONTRACTS.Month1 VALUE.Month1 OPEN_INT.Month1 CHG_IN_OI.Month1 TIMESTAMP
## 40 A 30-Apr-2015 1750.00 1788.05 1746.00 1782.30 1782.30 1469 6496.96 1353750 15250 10-APR-2015
## 43 B 30-Apr-2015 1627.50 1656.50 1627.50 1642.95 1642.95 2638 10830.05 1377250 -21000 10-APR-2015
## 46 C 30-Apr-2015 632.95 646.40 629.65 640.85 640.85 4964 15869.41 6264000 73500 10-APR-2015
## 49 D 30-Apr-2015 317.80 324.60 315.85 320.55 320.55 3416 10969.31 8228000 -192000 10-APR-2015
##
## [[2]]
## SYMBOL EXPIRY_DT.Month2 OPEN.Month2 HIGH.Month2 LOW.Month2 CLOSE.Month2 SETTLE_PR.Month2 CONTRACTS.Month2 VALUE.Month2 OPEN_INT.Month2 CHG_IN_OI.Month2 TIMESTAMP
## 41 A 28-May-2015 1789.0 1795.00 1760.00 1791.85 1791.85 78 347.91 8500 1250 10-APR-2015
## 44 B 28-May-2015 1653.3 1653.30 1645.45 1646.75 1653.85 14 57.68 17000 1500 10-APR-2015
## 47 C 28-May-2015 644.1 650.50 635.00 644.35 644.35 181 583.38 98000 6000 10-APR-2015
## 50 D 28-May-2015 319.5 326.65 318.40 322.35 322.35 82 264.93 216000 13000 10-APR-2015
##
## [[3]]
## SYMBOL EXPIRY_DT.Month3 OPEN.Month3 HIGH.Month3 LOW.Month3 CLOSE.Month3 SETTLE_PR.Month3 CONTRACTS.Month3 VALUE.Month3 OPEN_INT.Month3 CHG_IN_OI.Month3 TIMESTAMP
## 42 A 25-Jun-2015 0 0 0 1695.10 1804.80 0 0 0 0 10-APR-2015
## 45 B 25-Jun-2015 0 0 0 1613.90 1664.35 0 0 0 0 10-APR-2015
## 48 C 25-Jun-2015 0 0 0 614.60 649.10 0 0 0 0 10-APR-2015
## 51 D 25-Jun-2015 0 0 0 310.85 325.35 0 0 0 0 10-APR-2015
##
res <- Reduce(function(x,y) merge(x,y,by=keys,all=T), mdata );
res;
## SYMBOL TIMESTAMP EXPIRY_DT.Month1 OPEN.Month1 HIGH.Month1 LOW.Month1 CLOSE.Month1 SETTLE_PR.Month1 CONTRACTS.Month1 VALUE.Month1 OPEN_INT.Month1 CHG_IN_OI.Month1 EXPIRY_DT.Month2 OPEN.Month2 HIGH.Month2 LOW.Month2 CLOSE.Month2 SETTLE_PR.Month2 CONTRACTS.Month2 VALUE.Month2 OPEN_INT.Month2 CHG_IN_OI.Month2 EXPIRY_DT.Month3 OPEN.Month3 HIGH.Month3 LOW.Month3 CLOSE.Month3 SETTLE_PR.Month3 CONTRACTS.Month3 VALUE.Month3 OPEN_INT.Month3 CHG_IN_OI.Month3
## 1 A 10-APR-2015 30-Apr-2015 1750.00 1788.05 1746.00 1782.30 1782.30 1469 6496.96 1353750 15250 28-May-2015 1789.0 1795.00 1760.00 1791.85 1791.85 78 347.91 8500 1250 25-Jun-2015 0 0 0 1695.10 1804.80 0 0 0 0
## 2 B 10-APR-2015 30-Apr-2015 1627.50 1656.50 1627.50 1642.95 1642.95 2638 10830.05 1377250 -21000 28-May-2015 1653.3 1653.30 1645.45 1646.75 1653.85 14 57.68 17000 1500 25-Jun-2015 0 0 0 1613.90 1664.35 0 0 0 0
## 3 C 10-APR-2015 30-Apr-2015 632.95 646.40 629.65 640.85 640.85 4964 15869.41 6264000 73500 28-May-2015 644.1 650.50 635.00 644.35 644.35 181 583.38 98000 6000 25-Jun-2015 0 0 0 614.60 649.10 0 0 0 0
## 4 D 10-APR-2015 30-Apr-2015 317.80 324.60 315.85 320.55 320.55 3416 10969.31 8228000 -192000 28-May-2015 319.5 326.65 318.40 322.35 322.35 82 264.93 216000 13000 25-Jun-2015 0 0 0 310.85 325.35 0 0 0 0
As you can see, the core of my solution involves splitting the input data into separate data.frames by month number, which makes possible adding suffixes to all non-key columns independently for each split, and then repeatedly calling merge()
to merge them all together.
The mnum
vector stands for "month number". You could consider it to be a kind of "detached" column of the input data
object; it represents the month number within the primary key group to which each row in data
belongs. I use ave()
to call seq_along()
once for each group, which generates a sequential integer vector of length equal to the group size (i.e. number of rows in the group), which ave()
maps back to the positions of the group rows in the original data
object.
The mdata
object is a list of data.frames, where each component represents one month number. The actual extraction of the rows with a particular month number is done with a simple logical index operation:
data[mnum==x,]
where x
is the mnum
element, iterated over 1:max(mnum)
by lapply()
. The suffixing of non-key column names is done using setNames()
, deriving the replacement column names as follows:
ifelse(names(data)%in%keys,names(data),paste0(names(data),'.Month',x))
The above leaves the names of key-columns untouched, but appends '.Month{mnum}'
to the names of all non-key-columns.
Finally, all month-number splits must be merged back into one data.frame. I thought I'd be able to use a single call to merge()
(possibly with a little help from do.call()
) to do this, but was disappointed to discover that it only takes two arguments to merge, x
and y
(also see Simultaneously merge multiple data.frames in a list). Thus, I needed to call Reduce()
to achieve the repeated calls. The all=T
argument would be important if your different symbols had different numbers of expiry dates; then "short" symbols would not be represented on the RHS of the final merge(s), and thus would be dropped, if all=T
was not passed.
My output doesn't exactly match your sample output. Here are the discrepancies:
TIMESTAMP
column from what it was in the input, for example, 10-APR-2015
changed to 10-Apr-15
. My code does not touch the format of TIMESTAMP
.EXPIRY_DT
columns, which my solution retains under their suffixed EXPIRY_DT.Month1
, EXPIRY_DT.Month2
, etc. names. You can remove those columns afterward with grep()
on names()
and negative indexing, if so desired.If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With