Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

remove duplicates from values in the rows

I have a df with dimension 58000*900 which contains replicates in row values, I want to traverse through every row and remove them. An example will make it more clear.

df
IDs Name    col1    col2    col3
123 AB.C    1.3,1.3,1.3,1.3,1.3 0,0,0,0,0   5,5,5,5,5
234 CD-E    2,2,2,2,2   0.3,0.3,0.3,0.3,0.3 1,1,1,1,1
568 GHJ 123456      123456              123456
345 FGH 9,9,9,9,9   54,54,54,54,54  0,0,0,0,0

Apparently every value is replicated 5 times and in some cases their is a problem that there is no . or , separating the values. What I want is drop those lines which does not contain either . or , and for the rest remove the duplicate values. So, the output will be:

IDs Name    col1    col2    col3
123 AB.C    1.3 0   5
234 CD-E    2   0.3 1
345 FGH 9   54  0

dput(df)
structure(list(IDs = c(123L, 234L, 568L, 345L), Name = structure(c(1L, 
2L, 4L, 3L), .Label = c("ABC", "CDE", "FGH", "GHJ"), class = "factor"), 
    col1 = structure(c(2L, 3L, 1L, 4L), .Label = c("123456", 
    "1.3,1.3,1.3,1.3,1.3", "2,2,2,2,2", "9,9,9,9,9"), class = "factor"), 
    col2 = structure(1:4, .Label = c("0,0,0,0,0", "0.3,0.3,0.3,0.3,0.3", 
    "123456", "54,54,54,54,54"), class = "factor"), col3 = structure(c(4L, 
    2L, 3L, 1L), .Label = c("0,0,0,0,0", "1,1,1,1,1", "123456", 
    "5,5,5,5,5"), class = "factor")), .Names = c("IDs", "Name", 
"col1", "col2", "col3"), class = "data.frame", row.names = c(NA, 
-4L))
like image 358
Newbie Avatar asked Mar 31 '26 20:03

Newbie


1 Answers

First, we restructure your data in a long format using gather(), then we filter() for value with no , using grepl(). We then split the string in value into a list using strsplit() and make each element of the list it's own row using unnest(). We remove duplicated rows using distinct() and spread() back the key and values to columns.

library(dplyr)
library(tidyr)

df %>%
  gather(key, value, -(IDs:Name)) %>%
  filter(grepl(",", value)) %>%
  mutate(value = strsplit(value, ",")) %>%
  unnest(value) %>%
  distinct %>%
  spread(key, value)

Which gives:

#Source: local data frame [3 x 5]
#
#    IDs   Name  col1  col2  col3
#  (int) (fctr) (chr) (chr) (chr)
#1   123   AB.C   1.3     0     5
#2   234   CD-E     2   0.3     1
#3   345    FGH     9    54     0

Another idea would be to use cSplit from splitstackshape:

df %>%
  cSplit(., c("col1", "col2", "col3"), direction = "long", sep = ",") %>%
  group_by(Name) %>%
  filter(!any(is.na(.))) %>%
  distinct

Which gives:

#Source: local data table [3 x 5]
#Groups: Name
#
#    IDs   Name  col1  col2  col3
#  (int) (fctr) (dbl) (dbl) (int)
#1   123   AB.C   1.3   0.0     5
#2   234   CD-E   2.0   0.3     1
#3   345    FGH   9.0  54.0     0
like image 58
Steven Beaupré Avatar answered Apr 03 '26 08:04

Steven Beaupré