I'm most grateful for your time to read this.
I have a uber size 30GB file of 6 million records and 3000 (mostly categorical data) columns in csv format. I want to bootstrap subsamples for multinomial regression, but it's proving difficult even with my 64GB RAM in my machine and twice that swap file , the process becomes super slow and halts.
I'm thinking about generating subsample indicies in R and feeding them into a system command using sed or awk, but don't know how to do this. If someone knew of a clean way to do this using just R commands, I would be really grateful.
One problem is that I need to pick complete observations of subsamples, that is I need to have all the rows of a particular multinomial observation - they are not the same length from observation to observation. I plan to use glmnet and then some fancy transforms to get an approximation to the multinomial case. One other point is that I don't know how to choose sample size to fit around memory limits.
Appreciate your thoughts greatly.
R.version
platform x86_64-pc-linux-gnu
arch x86_64
os linux-gnu
system x86_64, linux-gnu
status
major 2
minor 15.1
year 2012
month 06
day 22
svn rev 59600
language R
version.string R version 2.15.1 (2012-06-22)
nickname Roasted Marshmallows
Yoda
As themel has pointed out, R is very very slow on reading csv files.
If you have sqlite, it really is the best approach, as it appears the data mining is not just for
one time, but over multiple session, in multiple ways.
Lets look at the options we have
Doing this in R is like 20 times slow, compared to a tool written in C (on my machine)
This is very slow
read.csv( file='filename.csv' , head=TRUE , sep=",")
Not that great, but it should work (I have never tried it on a 30 gig file, so I can not say for sure)
Using the resource from http://www.stata.com/help.cgi?dta
and code from https://svn.r-project.org/R-packages/trunk/foreign/src/stataread.c to read and write
and http://sourceforge.net/projects/libcsv/
(It has been done in the past. However I have not used it so I do not know how well it performs)
Then using the foreign
package (http://cran.r-project.org/web/packages/foreign/index.html), a simple
library(foreign)
whatever <- read.dta("file.dta")
would load your data
From s SQL console
LOAD DATA LOCAL INFILE 'file.csv' INTO TABLE my_table
IGNORE 1 LINES <- If csv file contains headers
FIELDS TERMINATED BY ',' ENCLOSED BY '"' ESCAPED BY '\\' LINES TERMINATED BY '\n'
Or
mysql -e "LOAD DATA INFILE 'ls.dat' INTO TABLE mytable1" mydatabase
Then play from the R console, using RMySQL
R interface to the MySQL database
http://cran.r-project.org/web/packages/RMySQL/index.html
install.packages('RMySQL')
Then play around like
mydb = dbConnect(MySQL(), user=username, password=userpass, dbname=databasename, host=host)
dbListTables(mydb)
record <- dbSendQuery(mydb, "select * from whatever")
dbClearResult(rs)
dbDisconnect(mydb)
Download, from https://code.google.com/p/sqldf/ if you do not have the package
or svn checkout http://sqldf.googlecode.com/svn/trunk/ sqldf-read-only
From the R console,
install.packages("sqldf")
# shows built in data frames
data()
# load sqldf into workspace
library(sqldf)
MyCsvFile <- file("file.csv")
Mydataframe <- sqldf("select * from MyCsvFile", dbname = "MyDatabase", file.format = list(header = TRUE, row.names = FALSE))
And off you go!
Presonally, I would recomend the library(sqldf) option :-)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With