Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Import Large Unusual File To R

First time poster here, so I'll try and make myself as clear as possible on the help I need. I'm fairly new to R, and this is my first real independent programming experience.

I have stock tick data for about 2.5 years, each day has its own file. The files are .txt and consist of approximately 20-30 million rows, and averaging I guess 360mb each. I am working one file at a time for now. I don't need all the data these files contain, and I was hoping that I could use the programming to minimize my files a bit.

Now my problem is that I am having some difficulties with writing the proper code so R understands what I need it to do.

Let me first show you some of the data so you can get an idea of the formatting.

M977
R 64266NRE1VEW107 FI0009653869 2EURXHEL 630 1
R 64516SSA0B 80SHB SE0002798108 8SEKXSTO 40 1
R 645730BBREEW750 FR0010734145 8EURXHEL 640 1
R 64655OXS1C 900SWE SE0002800136 8SEKXSTO 40 1
R 64663OXS1P 450SWE SE0002800219 8SEKXSTO 40 1
R 64801SSIEGV LU0362355355 11EURXCSE 160 1
M978

Another snip of data:

M732
D 3547742
A 3551497B 200000 67110 02800
D 3550806
D 3547743
A 3551498S 250000 69228 09900

So as you can see each line begins with a letter. Each letter denotes what the line means. For instance R means order book directory message, M means milliseconds after last second, H means stock trading action message. There are 14 different letters used in total.

I have used the readLines function to import the data into R. This however seems to take a very long time for R to process when I want to work with the data.

Now I would like to write some sort of If function that says if the first letter is R then from offset 1 to 4 the code means Market Segment Identifier etc., and have R add columns to these so I can work with the data in a more structured fashion.

What is the best way of importing such data, and also creating some form of structure - i.e. use unique ID information in the line of data to analyze 1 stock at a time for instance.

like image 218
Morten Avatar asked Jul 26 '12 12:07

Morten


1 Answers

You can try something like this :

options(stringsAsFactors = FALSE)

f_A <- function(line,tab_A){
  values <- unlist(strsplit(line," "))[2:5]
  rbind(tab_A,list(name_1=as.character(values[1]),name_2=as.numeric(values[2]),name_3=as.numeric(values[3]),name_4=as.numeric(values[4])))
}

tab_A <- data.frame(name_1=character(),name_2=numeric(),name_3=numeric(),name_4=numeric(),stringsAsFactors=F)

for(i in readLines(con="/home/data.txt")){
    switch(strsplit(x=i,split="")[[1]][1],M=cat("1\n"),R=cat("2\n"),D=cat("3\n"),A=(tab_A <- f_A(i,tab_A)))
}

And replace cat() by different functions that add values to each type of data.frame. Use the pattern of the function f_A() to construct others functions and same things for the table structure.

like image 142
Alan Avatar answered Oct 24 '22 05:10

Alan