Combine data frame rows in R based on multiple columns

Tags:

I have a data frame in R which has one individual per line. Sometimes, individuals appear on two lines, and I would like to combine these lines based on the duplicated ID.

The problem is, each individual has multiple IDs, and when an ID appears twice, it does not necessarily appear in the same column.

Here is an example data frame:

dat <- data.frame(a = c('cat', 'canine', 'feline', 'dog'),
                  b = c('feline', 'puppy', 'meower', 'wolf'),
                  c = c('kitten', 'barker', 'kitty', 'canine'),
                  d = c('shorthair', 'collie', '', ''),
                  e = c(1, 5, 3, 8))

> dat
       a      b      c         d e
1    cat feline kitten shorthair 1
2 canine  puppy barker    collie 5
3 feline meower  kitty           3
4    dog   wolf canine           8

So rows 1 and 3 should be combined, because ID b of row 1 equals ID a of row 3. Similarly, ID a of row 2 equals ID c of row 4, so those rows should be combined as well.

Ideally, the output should look like this.

     a.1    b.1    c.1       d.1 e.1    a.2    b.3    c.2 d.2 e.2
1    cat feline kitten shorthair   1 feline meower  kitty       3
2 canine  puppy barker    collie   5    dog   wolf canine       8

(Note that the rows were not combined based on sharing IDs that are empty strings.)

My thoughts on how this could be done are below, but I'm pretty sure that I've been headed down the wrong path, so they're probably not helpful in solving the problem.

I thought that I could assign a row ID to each row, then melt the data. After that, I could to through row by row. When I found a row where one of the IDs matched an earlier row (e.g. when one of the row 3 IDs matches one of the row 1 IDs), I would change the every instance of the current row's row ID to match the earlier row ID (e.g. all row IDs of 3 would be changed to 1).

Here's the code I've been using:

dat$row.id <- 1:nrow(dat)
library(reshape2)
dat.melt <- melt(dat, id.vars = c('e', 'row.id'))
for (i in 2:nrow(dat.melt)) {
  # This next step is just to ignore the empty values
  if (grepl('^[[:space:]]*$', dat.melt$value[i])) {
    next
  }
  earlier.instance <- dat.melt$row.id[which(dat.melt$value[1:(i-1)] == dat.melt$value[i])]
  if (length(earlier.instance) > 0) {
    earlier.row.id <- earlier.instance[1]
    dat.melt$row.id[dat.melt$row.id == dat.melt$row.id[i]] <- earlier.row.id
  }
}

There are two problems with this approach.

It could be that an ID in row 3 matches row 1, and a different ID in row 5 matches row 3. In this case, the row IDs for both row 3 and row 5 should be changed to 1. This means that it's important to go through the rows sequentially, which has been leading me to use a for loop, not an apply function. I know that this is not very R-like, and with the large data frame I am working with it is very slow.
This code produces the output below. There are now multiple rows with identical row.id and variable, so I don't know how to cast it in order to get the kind of output I showed above. Using dcast here will be forced to use an aggregation function.

Output:

   e row.id variable     value
1  1      3        a       cat
2  5      2        a    canine
3  3      3        a    feline
4  8      2        a       dog
5  1      3        b    feline
6  5      2        b     puppy
7  3      3        b    meower
8  8      2        b      wolf
9  1      3        c    kitten
10 5      2        c    barker
11 3      3        c     kitty
12 8      2        c    canine
13 1      3        d shorthair
14 5      2        d    collie
15 3      3        d          
16 8      2        d

504

asked Sep 13 '16 14:09

njc

1 Answers

Here is an amateur attempt. I think it does some of what you want. I have expanded the data.frame (now a data.table) two rows to give a better example.

This loop creates a new column, dat$FirstMatchingID, that contains the ID from dat$e for the earliest match. I've only done it to match the first column, dat$a, but I think it could be expanded to b and c easily enough.

library(data.table)

dat <- data.table(a = c('cat', 'canine', 'feline', 'dog', 'feline','puppy'),
                  b = c('feline', 'puppy', 'meower', 'wolf', 'kitten', 'dog'),
                  c = c('kitten', 'barker', 'kitty', 'canine', 'cat','wolf'),
                  d = c('shorthair', 'collie', '', '','',''),
                  e = c(1, 5, 3, 8, 4, 6))

dat[, All := paste(a, b,c),]

for(i in 2:nrow(dat)) {
  print(dat[i])
  x <- grepl(dat[i]$a, dat[i-(1:i)]$All)
  y <- max(which(x %in% TRUE))
  dat[i, FirstMatchingID := dat[i-y]$e]
}

The result:

        a      b      c         d e                 All FirstMatchingID
1:    cat feline kitten shorthair 1   cat feline kitten              NA
2: canine  puppy barker    collie 5 canine puppy barker              NA
3: feline meower  kitty           3 feline meower kitty               1
4:    dog   wolf canine           8     dog wolf canine              NA
5: feline kitten    cat           4   feline kitten cat               1
6:  puppy    dog   wolf           6      puppy dog wolf               5

You then have to find out how you want to combine the rows to get your desired result, but hopefully this helps!

answered Oct 16 '22 19:10

moman822

Related questions
                            
                                Adding PATH to RStudio’s path
                            
                                Combining choropleth made in ggplot and ggmap
                            
                                Error: Package "ggplot2" could not be found, when loading the caret package
                            
                                Modifying an R factor?
                            
                                Rmarkdown Error: "! Paragraph ended before \@fileswith@ptions was complete"
                            
                                R: googlesheets/gs_upload: Upload to a specific folder
                            
                                HTML widgets in Jupyter R Notebook
                            
                                shinydashboard Sidebar Menu Overflow
                            
                                Duplicated legends when faceting in ggplotly
                            
                                Vectorized equality testing
                            
                                Shiny DataTable: Disable row selection for certain rows
                            
                                How to use R's testthat to unit test individual files?
                            
                                How does ggplot2 density differ from the density function?
                            
                                imageOutput click within conditionalPanel
                            
                                Rmarkdown Chunk Name from Variable
                            
                                when is R's `ByteCompile` counter-productive?
                            
                                How can I set R.HOME() and/or R_HOME correctly?
                            
                                Rmarkdown of Stargazer: LaTeX Error if align is set to TRUE
                            
                                Efficiently construct GRanges/IRanges from Rle vector
                            
                                R function for position of sun giving unexpected results

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Combine data frame rows in R based on multiple columns

Tags:

dataframe

r

reshape2

njc

People also ask

1 Answers

moman822

Recent Activity

Donate For Us