Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

mclapply long vectors not supported yet

Tags:

r

I'm trying to run some R code and it is crashing because of memory. The error that I get is:

Error in sendMaster(try(lapply(X = S, FUN = FUN, ...), silent = TRUE)) : 
  long vectors not supported yet: memory.c:3100

The function that creates the troubles is the following:

StationUserX <- function(userNDX){
  lat1 = deg2rad(geolocation$latitude[userNDX])
  long1 = deg2rad(geolocation$longitude[userNDX])
  session_user_id = as.character(geolocation$session_user_id[userNDX])
  #Find closest station
  Distance2Stations <- unlist(lapply(stationNDXs, Distance2StationX, lat1, long1))
  # Return index for closest station and distance to closest station
  stations_userX = data.frame(session_user_id = session_user_id, 
                              station = ghcndstations$ID[stationNDXs], 
                              Distance2Station = Distance2Stations)    
  stations_userX = stations_userX[with(stations_userX, order(Distance2Station)), ]
  stations_userX = stations_userX[1:100,] #only the 100 closest stations...
  row.names(stations_userX)<-NULL
  return(stations_userX)
}

I run this function using mclapply 50k times. StationUserX is calling Distance2StationX 90k times.

Is there an obvious way to optimize the function StationUserX ?

like image 992
Ignacio Avatar asked Apr 22 '14 22:04

Ignacio


1 Answers

mclapply is having trouble sending back all the data from worker threads into the main thread. That's because of prescheduling, where it runs large number of iterations per thread, and then syncs all the data back. That's nice and fast, but results in >2GB of data being sent back, which it can't do.

Run mclapply with mc.preschedule=F to turn off pre-scheduling. Now, each iteration will spawn its own thread and will return its own data. It won't go quite as fast, but it gets around the problem.

like image 112
Stan Avatar answered Sep 20 '22 18:09

Stan