I am trying to take a subset of a data frame, based on the occurence of a value. This is best explained in an example, given below. This question has a high relation to: Selecting top finite number of rows for each unique value of a column in a data fame in R However, i want to vary the number of items selected by the head() command.
#Sample data
input <- matrix( c(1000001,1000001,1000001,1000001,1000001,1000001,1000002,1000002,1000002,1000003,1000003,1000003,100001,100002,100003,100004,100005,100006,100002,100003,100007,100002,100003,100008,"2011-01-01","2011-01-02","2011-01-01","2011-01-04","2011-01-01","2011-01-02","2011-01-01","2011-01-04","2011-01-01","2011-01-02","2011-01-01","2011-01-04"), ncol=3)
colnames(input) <- c( "Product" , "Something" ,"Date")
input <- as.data.frame(input)
input$Date <- as.Date(input[,"Date"], "%Y-%m-%d")
#Sort based on date, I want to leave out the entries with the oldest dates.
input <- input[ with( input, order(Date)), ]
#Create number of items I want to select
table_input <- as.data.frame(table(input$Product))
table_input$twentyfive <- ceiling( table_input$Freq*0.25 )
#This next part is a very time consuming method (Have 2 mln rows, 90k different products)
first <- TRUE
for( i in table_input$Var1 ) {
data_selected <- input[input$Product == i,]
number <- table_input[table_input$Var1 == i ,]$twentyfive
head <- head( data_selected, number)
if( first == FALSE) {
output <- rbind(output, head)
} else {
output <- head
}
first <- FALSE
}
Hoping that someone knows a better, more efficient way. I tried to use the split function from the answer here: Selecting top finite number of rows for each unique value of a column in a data fame in R to split on the products and then try to iterate over them and select the head(). However the split function always runs out of memory (cannot allocate ..)
input_split <- split(input, input$Product) #Works here, but not i my problem.
So in the end my problem is that i want te select a different amount of each unique Product. So here 2 items from 1000001 and 1 item from 1000002 and 1000003.
Use pandas. DataFrame. head(n) to get the first n rows of the DataFrame. It takes one optional argument n (number of rows you want to get from the start).
You can just select the columns with, for instance, a<-dat[,(dat[1,]) == 1] ; the only trick is re-setting the column names when you end up extracting a single column.
You can get unique values in column (multiple columns) from pandas DataFrame using unique() or Series. unique() functions. unique() from Series is used to get unique values from a single column and the other one is used to get from multiple columns.
Two solutions spring to mind. plyr::ddply
is designed for your needs but using a data.table
will be waaaaaay faster.
You want to take a data.frame
split it up into chunks, remove all the bottom 25% of rows of each chunk which is sorted by date and recombine into a data.frame
. This can be accomplished in one simple line...
require( plyr )
ddply( input , .(Product) , function(x) x[ - c( 1 : ceiling( nrow(x) * 0.25 ) ) , ] )
# Product Something Date
#1 1000001 100005 2011-01-01
#2 1000001 100002 2011-01-02
#3 1000001 100006 2011-01-02
#4 1000001 100004 2011-01-04
#5 1000002 100007 2011-01-01
#6 1000002 100003 2011-01-04
#7 1000003 100002 2011-01-02
#8 1000003 100008 2011-01-04
data.table
solutionFor data.table
you will need the latest development version from r-forge
(due to us of negative subscript not being implemented in the CRAN version of data.table yet). Make sure you follow the install.package
call to get the latest version...
install.packages( "data.table" , repos="http://r-forge.r-project.org" )
require( data.table )
DT <- data.table( input )
# Sort by Product then Date very quickly
setkeyv( DT , c( "Product" , "Date" ) )
# Return the bottom 75% of rows (i.e. not the earliest)
DT[ , tail( .SD , -ceiling( nrow(.SD) * .25 ) ) , by = Product ]
# Product Something Date
#1: 1000001 100005 2011-01-01
#2: 1000001 100002 2011-01-02
#3: 1000001 100006 2011-01-02
#4: 1000001 100004 2011-01-04
#5: 1000002 100007 2011-01-01
#6: 1000002 100003 2011-01-04
#7: 1000003 100002 2011-01-02
#8: 1000003 100008 2011-01-04
data.table
You could more easily do this (so you don't require development version of data.table
)...
DT[ , .SD[ -c( 1:ceiling( .25 * .N ) ) ] , by = Product ]
And you can also use lapply
in the j
argument (I was worried about my use of .SD
) and this runs in ~ 14 seconds on a data.table
of 2e6 rows with 90,000 products (groups)...
set.seed(1)
Product <- sample( 1:9e5 , 2e6 , repl = TRUE )
dates <- sample( 1:20 , 2e6 , repl = TRUE )
Date <- as.Date( Sys.Date() + dates )
DT <- data.table( Product = Product , Date = Date )
system.time( { setkeyv( DT , c( "Product" , "Date" ) ); DT[ , lapply( .SD , `[` , -c( 1:ceiling( .25 * .N ) ) ) , by = Product ] } )
# user system elapsed
# 14.65 0.03 14.74
data.table
!So thanks to @Arun (who is now an author of the data.table
package) we now have the best way to use data.table
which is to use .I
which is an integer vector of all the row indices, subset in [
by removing the first 25% of record with -(1:ceiling(.N*.25))
, and then performaing a subset using these row indices to get the final table. This is ~ 4-5 times faster than using my .SD
method above. Amazing stuff!
system.time( DT[ DT[, .I[-(1:ceiling(.N*.25))] , by = Product]$V1] )
user system elapsed
3.02 0.00 3.03
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With