Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

speeding up data frame matching

Tags:

performance

r

I have two dataframes, much like these:

data = data.frame(data=cbind(1:12,rep(c(1,2),6),rep(c(1,2,3),4)))
colnames(data)=c('v','h','c')

lookup = data.frame(data=cbind(c(rep(1,3),rep(2,3)),rep(c(1,2,3),2),21:26))
colnames(lookup)=c('h','c','t')

I want to subtract lookup$t from data$v where the h and c columns match.

I thought something like this would work

data$v-lookup$t[lookup$h==data$h&lookup$c==data$c]

but doesn't magically know that I want to implicitly iterate over the rows of data

I ended up doing this

myt = c()
for(i in 1:12) {
myt[i] = lookup$t[lookup$h==data$h[i]&lookup$c==data$c[i]]
}

which works fine, but I'm hoping someone can suggest a more sensible way that doesn't involve a loop.

like image 924
ansate Avatar asked Jan 15 '11 04:01

ansate


People also ask

Is Iterrows faster than for loop?

The Pandas Built-In Function: iterrows() — 321 times faster.

How can I make my pandas 100x faster?

apply() function to speed it up over 100x. This article takes Pandas' standard dataframe. apply function and upgrades it with a bit of Cython to speed up execution from 3 minutes to under 2 seconds.

Is apply faster than Itertuples?

While slower than apply , itertuples is quicker than iterrows , so if looping is required, try implementing itertuples instead. Using map as a vectorized solution gives even faster results.


2 Answers

Sounds like you could merge and then do the math:

dataLookedUp <- merge(data, lookup)
dataLookedUp$newValue <- with(dataLookedUp, v - t )

For your real data, is the merge and calc faster?

If data and/or lookup is really big you might use data.table to create an index before the merge in order to speed it up.

like image 97
JD Long Avatar answered Sep 27 '22 22:09

JD Long


An alternative that is 1.) more familiar to those accustomed to SQL queries and 2.) often faster than the standard merge is to use the sqldf package. (Note that on Mac OS X, you'll probably want to install Tcl/Tk, on which sqldf depends.) As an added bonus, sqldf converts strings to factors automagically by default.

install.packages("sqldf")
library(sqldf)
data <- data.frame(v = 1:12, h = rep(c("one", "two"), 6), c = rep(c("one", "two", "three"), 4))
lookup <- data.frame(h = c(rep("one", 3), rep("two", 3)), c = rep(c("one", "two", "three"), 2), t =  21:26)
soln <- sqldf("select * from data inner join lookup using (h, c)")
soln <- transform(soln, v.minus.t = v - t)
like image 35
Michael P. Manti Avatar answered Sep 27 '22 22:09

Michael P. Manti