I have a csv/log file of 35=S (Quote messages; "Tag=Value") and I need to extract the rates into a proper CSV file for data mining. This is not strictly FIX related, it's more of a R related question on how to clean a dataset.
The raw messages look something like this:
190=1.1204 ,191=-0.000029,193=20141008,537=0 ,631=1.12029575,642=0.000145,10=56
190=7.20425,191=0.000141 ,537=0 ,631=7.2034485,10=140 , ,
190=1.26237,191=0 ,537=1 ,10=068 , , ,
I need first to get to an intermediate data set that looks like this, where the same tags are aligned.
190=1.1204 ,191=-0.000029,193=20141008,537=0,631=1.12029575,642=0.000145,10=56
190=7.20425,191=0.000141 , ,537=0,631=7.2034485 , ,10=140
190=1.26237,191=0 , ,537=1, , ,10=068
which in turn will need to be converted to this:
190 ,191 ,193 ,537,631 ,642 ,10
1.1204 ,-0.000029,20141008,0 ,1.12029575,0.000145,56
7.20425,0.000141 , ,0 ,7.2034485 , ,140
1.26237,0 , ,1 , , ,068
I'm in the midst of developing a bash script with awk but I wonder if I can do that in R. At present, my greatest challenge is arriving to the intermediate table. From the intermediate to the final table I thought of using the R with the tidyr package, specifically function 'separate'. If anybody can suggest a better logic, I'll greatly appreciate!
Another possibility. Start with same scan
as @Andrie, but also use arguments strip.white
and na.strings
:
x <- scan(text = "190=1.1204 ,191=-0.000029,193=20141008,537=0 ,631=1.12029575,642=0.000145,10=56
190=7.20425,191=0.000141 ,537=0 ,631=7.2034485,10=140 , ,
190=1.26237,191=0 ,537=1 ,10=068 , , ,",
sep = ",",
what = "character",
strip.white = TRUE,
na.strings = "")
# remove NA
x <- x[!is.na(x)]
Then use colsplit
and dcast
from reshape2
package:
library(reshape2)
# split 'x' into two columns
d1 <- colsplit(string = x, pattern = "=", names = c("x", "y"))
# create an id variable, needed in dcast
d1$id <- ave(d1$x, d1$x, FUN = seq_along)
# reshape from long to wide
d2 <- dcast(data = d1, id ~ x, value.var = "y")
# id 10 190 191 193 537 631 642
# 1 1 56 1.12040 -0.000029 20141008 0 1.120296 0.000145
# 2 2 140 7.20425 0.000141 NA 0 7.203449 NA
# 3 3 68 1.26237 0.000000 NA 1 NA NA
Because you mentioned tidyr
:
library(tidyr)
d1 <- separate(data = data.frame(x), col = x, into = c("x", "y"), sep = "=")
d1$id <- ave(d1$x, d1$x, FUN = seq_along)
spread(data = d1, key = x, value = y)
# id 10 190 191 193 537 631 642
# 1 1 56 1.1204 -0.000029 20141008 0 1.12029575 0.000145
# 2 2 140 7.20425 0.000141 <NA> 0 7.2034485 <NA>
# 3 3 068 1.26237 0 <NA> 1 <NA> <NA>
This retains the values as character
. If you want numeric
, you can set convert = TRUE
in spread
.
EDITED. Full solution using base R functions only:
dat <- scan(sep=",", what="character", text="190=1.1204 ,191=-0.000029,193=20141008,537=0 ,631=1.12029575,642=0.000145,10=56
190=7.20425,191=0.000141 ,537=0 ,631=7.2034485,10=140 , ,
190=1.26237,191=0 ,537=1 ,10=068 , , ,")
dat <- gsub(" ", "", dat)
dat <- dat[dat != ""]
x <- as.data.frame(
matrix(
unlist(
sapply(dat, strsplit, split = "=", USE.NAMES=FALSE)
),
ncol=2, byrow=TRUE
)
)
z <- unstack(x, V2 ~ V1)
The resulting object is a named list that is close to what you wanted. You will have to do some additional work to convert this into matrix, if required.
$`10`
[1] "56" "140" "068"
$`190`
[1] "1.1204" "7.20425" "1.26237"
$`191`
[1] "-0.000029" "0.000141" "0"
....
etc.
From here, you simply have to pad the list with the appropriate number of NA values:
maxLength <- max(sapply(z, length))
sapply(z, function(x)c(as.numeric(x), rep(NA, maxLength - length(x))))
Gives:
10 190 191 193 537 631 642
[1,] 56 1.12040 -0.000029 20141008 0 1.120296 0.000145
[2,] 140 7.20425 0.000141 NA 0 7.203449 NA
[3,] 68 1.26237 0.000000 NA 1 NA NA
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With