How do you extract a few random rows from a data.table on the fly

Tags:

I have a large data.table (about 24000 rows and growing). I want to subset that datatable based on a couple of criteria and from that subset (ends up being about 3000 rows) I want to randomly sample just 4 rows. I do not want to create a named 3000 or so row data.table, count its rows and then sample based on row number. How can I do it on the fly? Or should I just suck it up by creating the table and then working on it, sampling it and then using rm() to get rid of it?

Lets simulate my issue

require(data.table)
random.length  <-  sample(x = 15:30, size = 1)
data.table(city=sample(c("Cape Town", "New York", "Pittsburgh", "Tel Aviv", "Amsterdam"), size=random.length, replace = TRUE), score = sample(x=1:10, size = random.length, replace=TRUE))

That makes a random length table, which simulates the fact that depending on my criteria and depending on my starting table, I do not know what the length of the subsetted table with be

Now, if I just wanted the first three rows I could do as so

data.table(city=sample(c("Cape Town", "New York", "Pittsburgh", "Tel Aviv", "Amsterdam"), size=random.length, replace = TRUE), score = sample(x=1:10, size = random.length, replace=TRUE))[1:3]

But let us say I did not want the first three rows but rather a random 3 rows, then I would want to do something such as this...

data.table(city=sample(c("Cape Town", "New York", "Pittsburgh", "Tel Aviv", "Amsterdam"), size=random.length, replace = TRUE), score = sample(x=1:10, size = random.length, replace=TRUE))[sample(x= 1:number of rows of that previous data.table,size = 3 ]

That will not work. How do I compute, on the fly, what the length of the initial data.frame was?

281

asked Jul 10 '14 20:07

Farrel

3 Answers

Have just made .N work in i. New README item :

.N is now available in i, FR#724. Thanks to newbie indirectly here and Farrel directly here.

This now works :

DT[...][...][sample(.N,3)]

e.g.

> random.length  <-  sample(x = 15:30, size = 1)
> data.table(city = sample(c("Cape Town", "New York", "Pittsburgh", "Tel Aviv", "Amsterdam"),size=random.length, replace = TRUE), score = sample(x=1:10, size = random.length, replace=TRUE))[sample(.N, 3)] 
         city score
1:   New York     4
2: Pittsburgh     3
3:  Cape Town     9
>

answered Oct 22 '22 11:10

Matt Dowle

There is a two step approach:

Compute the index i using .I
Sample on index i

Example code.

require(data.table)
random.length  <-  sample(x = 15:30, size = 1)
data.table(city = sample(c("Cape Town", "New York", "Pittsburgh", "Tel Aviv", "Amsterdam"),size=random.length, replace = TRUE), score = sample(x=1:10, size = random.length, replace=TRUE))[,i := .I][sample(i, 3)]

answered Oct 22 '22 10:10

djhurio

Another alternative way would be to use sapply approach.
For example:

  as.data.table(sapply(DT[], sample, 10))

answered Oct 22 '22 10:10

Daniel

Related questions
                            
                                Adding text to a grid.table plot
                            
                                Cumulative count of each value [duplicate]
                            
                                How to develop a package in R?
                            
                                Adding labels to ggplot bar chart
                            
                                Creating a new column to a data frame using a formula from another variable
                            
                                Insert Layer underneath existing layers in ggplot2 object
                            
                                Using ggplot function in R error : could not find function ggplot
                            
                                Can't install rJava on ubuntu system
                            
                                Update a Value in One Column Based on Criteria in Other Columns
                            
                                R: applying function over matrix and keeping matrix dimensions
                            
                                How can I make R read my environmental variables?
                            
                                R reading a huge csv
                            
                                Get rid of \addlinespace in kable
                            
                                For loop in R with increments
                            
                                Are these strings or variables?
                            
                                Remove pattern from string with gsub
                            
                                R: Text progress bar in for loop
                            
                                Convert summary to data.frame
                            
                                Changing whisker definition in geom_boxplot
                            
                                How do I select variables in an R dataframe whose names contain a particular string?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How do you extract a few random rows from a data.table on the fly

Tags:

r

data.table

sample