Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Breaking a list of character strings into partitions

Here is my problem. I have a dataset with 200k rows.

  • Each row corresponds to a test conducted on a subject.
  • Subjects have unequal number of tests.
  • Each test is dated.

I want to assign an index to each test. E.g. The first test of subject 1 would be 1, the second test of subject 1 would be 2. The first test of subject 2 would be 1 etc..

My strategy is to get a list of unique Subject IDs, use lapply to subset the dataset into a list of dataframes using the unique Subject IDs, with each Subject having his/her own dataframe with the tests conducted. Ideally I would then be able to sort each dataframe of each subject and assign an index for each test.

However, doing this over a 200k x 32 dataframe made my laptop (i5, Sandy Bridge, 4GB ram) run out of memory quite quickly.

I have 2 questions:

  1. Is there a better way to do this?
  2. If there is not, my only option to overcome the memory limit is to break my unique SubjectID list into smaller sets like 1000 SubjectIDs per list, lapply it through the dataset and at the end of everything, join the lists together. Then, how do I create a function to break my SubjectID list by supplying say an integer that denotes the number of partitions. e.g. BreakPartition(Dataset, 5) will break the dataset into 5 partitions equally.

Here is code to generate some dummy data:

UniqueSubjectID <- sapply(1:500, function(i) paste(letters[sample(1:26, 5, replace = TRUE)], collapse =""))
UniqueSubjectID <- subset(UniqueSubjectID, !duplicated(UniqueSubjectID))
Dataset <- data.frame(SubID = sample(sapply(1:500, function(i) paste(letters[sample(1:26, 5, replace = TRUE)], collapse ="")),5000, replace = TRUE))
Dates <- sample(c(dates = format(seq(ISOdate(2010,1,1), by='day', length=365), format='%d.%m.%Y')), 5000, replace = TRUE)
Dataset <- cbind(Dataset, Dates)
like image 252
R J Avatar asked May 16 '12 07:05

R J


1 Answers

I would guess that the splitting/lapply is what is using up the memory. You should consider a more vectorized approach. Starting with a slightly modified version of your example code:

n <- 200000
UniqueSubjectID <- replicate(500, paste(letters[sample(26, 5, replace=TRUE)], collapse =""))
UniqueSubjectID <- unique(UniqueSubjectID)
Dataset <- data.frame(SubID = sample(UniqueSubjectID , n, replace = TRUE))
Dataset$Dates <- sample(c(dates = format(seq(ISOdate(2010,1,1), by='day', length=365), format='%d.%m.%Y')), n, replace = TRUE)

And assuming that what you want is an index counting the tests by date order by subject, you could do the following.

Dataset <- Dataset[order(Dataset$SubID, Dataset$Dates), ]
ids.rle <- rle(as.character(Dataset$SubID))
Dataset$SubIndex <- unlist(sapply(ids.rle$lengths, function(n) 1:n))

Now the 'SubIndex' column in 'Dataset' contains a by-subject numbered index of the tests. This takes a very small amount of memory and runs in a few seconds on my 4GB Core 2 duo Laptop.

like image 173
leif Avatar answered Nov 11 '22 21:11

leif