I have a data frame in R which has one individual per line. Sometimes, individuals appear on two lines, and I would like to combine these lines based on the duplicated ID.
The problem is, each individual has multiple IDs, and when an ID appears twice, it does not necessarily appear in the same column.
Here is an example data frame:
dat <- data.frame(a = c('cat', 'canine', 'feline', 'dog'),
b = c('feline', 'puppy', 'meower', 'wolf'),
c = c('kitten', 'barker', 'kitty', 'canine'),
d = c('shorthair', 'collie', '', ''),
e = c(1, 5, 3, 8))
> dat
a b c d e
1 cat feline kitten shorthair 1
2 canine puppy barker collie 5
3 feline meower kitty 3
4 dog wolf canine 8
So rows 1 and 3 should be combined, because ID b
of row 1 equals ID a
of row 3. Similarly, ID a
of row 2 equals ID c
of row 4, so those rows should be combined as well.
Ideally, the output should look like this.
a.1 b.1 c.1 d.1 e.1 a.2 b.3 c.2 d.2 e.2
1 cat feline kitten shorthair 1 feline meower kitty 3
2 canine puppy barker collie 5 dog wolf canine 8
(Note that the rows were not combined based on sharing IDs that are empty strings.)
My thoughts on how this could be done are below, but I'm pretty sure that I've been headed down the wrong path, so they're probably not helpful in solving the problem.
I thought that I could assign a row ID to each row, then melt the data. After that, I could to through row by row. When I found a row where one of the IDs matched an earlier row (e.g. when one of the row 3 IDs matches one of the row 1 IDs), I would change the every instance of the current row's row ID to match the earlier row ID (e.g. all row IDs of 3 would be changed to 1).
Here's the code I've been using:
dat$row.id <- 1:nrow(dat)
library(reshape2)
dat.melt <- melt(dat, id.vars = c('e', 'row.id'))
for (i in 2:nrow(dat.melt)) {
# This next step is just to ignore the empty values
if (grepl('^[[:space:]]*$', dat.melt$value[i])) {
next
}
earlier.instance <- dat.melt$row.id[which(dat.melt$value[1:(i-1)] == dat.melt$value[i])]
if (length(earlier.instance) > 0) {
earlier.row.id <- earlier.instance[1]
dat.melt$row.id[dat.melt$row.id == dat.melt$row.id[i]] <- earlier.row.id
}
}
There are two problems with this approach.
row.id
and variable
, so I don't know how to cast it in order to get the kind of output I showed above. Using dcast
here will be forced to use an aggregation function.Output:
e row.id variable value
1 1 3 a cat
2 5 2 a canine
3 3 3 a feline
4 8 2 a dog
5 1 3 b feline
6 5 2 b puppy
7 3 3 b meower
8 8 2 b wolf
9 1 3 c kitten
10 5 2 c barker
11 3 3 c kitty
12 8 2 c canine
13 1 3 d shorthair
14 5 2 d collie
15 3 3 d
16 8 2 d
Convert multiple columns into a single column, To combine numerous data frame columns into one column, use the union() function from the tidyr package.
Using base merge() to Join Multiple ColumnsUsing merge() function from the R base can also be used to perform joining on multiple columns of data frame. To do so you need to create a vector for by. x with the columns you wanted to join on and create a similar vector for by. y .
Here is an amateur attempt. I think it does some of what you want. I have expanded the data.frame (now a data.table) two rows to give a better example.
This loop creates a new column, dat$FirstMatchingID
, that contains the ID from dat$e
for the earliest match. I've only done it to match the first column, dat$a
, but I think it could be expanded to b
and c
easily enough.
library(data.table)
dat <- data.table(a = c('cat', 'canine', 'feline', 'dog', 'feline','puppy'),
b = c('feline', 'puppy', 'meower', 'wolf', 'kitten', 'dog'),
c = c('kitten', 'barker', 'kitty', 'canine', 'cat','wolf'),
d = c('shorthair', 'collie', '', '','',''),
e = c(1, 5, 3, 8, 4, 6))
dat[, All := paste(a, b,c),]
for(i in 2:nrow(dat)) {
print(dat[i])
x <- grepl(dat[i]$a, dat[i-(1:i)]$All)
y <- max(which(x %in% TRUE))
dat[i, FirstMatchingID := dat[i-y]$e]
}
The result:
a b c d e All FirstMatchingID
1: cat feline kitten shorthair 1 cat feline kitten NA
2: canine puppy barker collie 5 canine puppy barker NA
3: feline meower kitty 3 feline meower kitty 1
4: dog wolf canine 8 dog wolf canine NA
5: feline kitten cat 4 feline kitten cat 1
6: puppy dog wolf 6 puppy dog wolf 5
You then have to find out how you want to combine the rows to get your desired result, but hopefully this helps!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With