First time poster here, so I'll try and make myself as clear as possible on the help I need. I'm fairly new to R, and this is my first real independent programming experience.
I have stock tick data for about 2.5 years, each day has its own file. The files are .txt and consist of approximately 20-30 million rows, and averaging I guess 360mb each. I am working one file at a time for now. I don't need all the data these files contain, and I was hoping that I could use the programming to minimize my files a bit.
Now my problem is that I am having some difficulties with writing the proper code so R understands what I need it to do.
Let me first show you some of the data so you can get an idea of the formatting.
M977
R 64266NRE1VEW107 FI0009653869 2EURXHEL 630 1
R 64516SSA0B 80SHB SE0002798108 8SEKXSTO 40 1
R 645730BBREEW750 FR0010734145 8EURXHEL 640 1
R 64655OXS1C 900SWE SE0002800136 8SEKXSTO 40 1
R 64663OXS1P 450SWE SE0002800219 8SEKXSTO 40 1
R 64801SSIEGV LU0362355355 11EURXCSE 160 1
M978
Another snip of data:
M732
D 3547742
A 3551497B 200000 67110 02800
D 3550806
D 3547743
A 3551498S 250000 69228 09900
So as you can see each line begins with a letter. Each letter denotes what the line means. For instance R
means order book directory message, M
means milliseconds after last second, H
means stock trading action message. There are 14 different letters used in total.
I have used the readLines
function to import the data into R. This however seems to take a very long time for R to process when I want to work with the data.
Now I would like to write some sort of If function that says if the first letter is R
then from offset 1 to 4 the code means Market Segment Identifier etc., and have R add columns to these so I can work with the data in a more structured fashion.
What is the best way of importing such data, and also creating some form of structure - i.e. use unique ID information in the line of data to analyze 1 stock at a time for instance.
You can try something like this :
options(stringsAsFactors = FALSE)
f_A <- function(line,tab_A){
values <- unlist(strsplit(line," "))[2:5]
rbind(tab_A,list(name_1=as.character(values[1]),name_2=as.numeric(values[2]),name_3=as.numeric(values[3]),name_4=as.numeric(values[4])))
}
tab_A <- data.frame(name_1=character(),name_2=numeric(),name_3=numeric(),name_4=numeric(),stringsAsFactors=F)
for(i in readLines(con="/home/data.txt")){
switch(strsplit(x=i,split="")[[1]][1],M=cat("1\n"),R=cat("2\n"),D=cat("3\n"),A=(tab_A <- f_A(i,tab_A)))
}
And replace cat()
by different functions that add values to each type of data.frame. Use the pattern of the function f_A()
to construct others functions and same things for the table structure.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With