Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Merging dataframes in R on a pre-sorted column?

Tags:

merge

dataframe

r

I usually work with big dataframes that are pretty well sorted (or can be easily sorted).

Given two dataframes, both sorted by 'user'

some.data <user> <data_1> <data_2> 
user <user> <user_attr_1> <user_attr_2>

And I run m = merge(some.data,user), I receive the result as:

m = <user> <data_1> <data_2> <user_attr_1> <user_attr_2>

And this is fine so.

But merge doesn't take advantage of these dataframes being sorted on the common column making the merge pretty CPU/memory heavy. However, this merge could be done in O(n)

I am wondering if there is a way in R to conduct an efficient merge on sorted datasets?

like image 867
zoltanctoth Avatar asked Oct 28 '11 12:10

zoltanctoth


People also ask

How do I merge two Dataframes based on a column in R?

The merge() function in base R can be used to merge input dataframes by common columns or row names. The merge() function retains all the row names of the dataframes, behaving similarly to the inner join. The dataframes are combined in order of the appearance in the input function call.

How do I combine two data frames in R?

To join two data frames (datasets) vertically, use the rbind function. The two data frames must have the same variables, but they do not have to be in the same order. If data frameA has variables that data frameB does not, then either: Delete the extra variables in data frameA or.

How do I merge columns in R?

How do I concatenate two columns in R? To concatenate two columns you can use the <code>paste()</code> function. For example, if you want to combine the two columns A and B in the dataframe df you can use the following code: <code>df['AB'] <- paste(df$A, df$B)</code>.


1 Answers

I don't have any experience with it, but as far as I know, this is one of the issues that package data.tablewas designed to improve.

For most practical purposes, data.table=data.frame + index. As a consequence, when used right, this improves performance of quite a few large operations.

There is a danger that turning your data.frame into a data.table (i.e. adding the index) could take some time (although I expect this to be well optimized), but once you've got it up, functions like merge can easily use the index for better performance.

like image 85
Nick Sabbe Avatar answered Nov 01 '22 21:11

Nick Sabbe