Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

In R, Merge two data frames, fill down the blanks

Tags:

merge

r

Say I have these two data frames:

big.table <- data.frame("idx" = 1:100)

small.table <- data.frame("idx" = sample(1:100, 10), "color" = sample(colors(),10))

I want to merge them together like this:

merge(small.table, big.table, by = "idx", all.y=TRUE)

idx           color
1     1            <NA>
2     2            <NA>
3     3         salmon2
4     4            <NA>
5     5            <NA>
6     6            <NA>
...
20   20            <NA>
21   21            <NA>
22   22           blue4
23   23          grey99
24   24            <NA>
25   25            <NA>
26   26            <NA>
...

Now I need to fill the values in the 'color' column down the table so that all the NAs are set to values that come before in the table.

NOTES: The problem involves a log file generated from a computer program, not in any standard log format. Blocks of lines in this log file belong to a 'process' that is identified in the first line of the block. I've pulled out information in the relevant lines of the log file, most of which belong to a process, and created a data table containing that information (the line number, time stamp, etc.). Now I need to fill into this table the 'process' names that correspond to each line from a small.table which has a line number.

There might not be a 'process' (color in the example above) for the lines at the top of the big.table. Those lines should remain NA.

Once the first 'process' starts, every line between that process start line and the next belongs to the first process. When the second process starts, every line between that process start line and the next process start line belongs to the second process. And so on. The process lines are never the same line number as the other lines that I've collected into my log file data frame.

My plan is to create the big.table to be a sequence of all log line numbers and merge the small table to it. Then I can "fill down" the process name and merge the big table to the log file keeping only the log file with everything joined to it.

I'm open to other approaches.

like image 333
brandco Avatar asked Feb 12 '13 23:02

brandco


2 Answers

It sounds like you need na.locf from the package zoo (stands for last observation carried forward):

library(zoo)
tbl <- merge(small.table, big.table, by = "idx", all.y=TRUE)
tbl$color2 <- na.locf(tbl$color,na.rm = FALSE)
like image 183
joran Avatar answered Sep 29 '22 15:09

joran


A data.table solution:

require(data.table)
b <- data.table(big.table, key="idx")
s <- data.table(small.table, key="idx")
s[b, roll=T]

#      idx          color
#   1:   1             NA
#   2:   2             NA
#   3:   3             NA
#   4:   4          blue3
#   5:   5          blue3
#   6:   6          blue3
#   7:   7          blue3
#   8:   8          blue3
#   9:   9          blue3
#  10:  10          blue3
#  11:  11   navajowhite1
#  12:  12   navajowhite1
#  . . . .
like image 21
Arun Avatar answered Sep 29 '22 15:09

Arun