I have a dataframe such as ; <pre class="prettyprint"><code>COL1 COL2 A,A,A 2 B 1 C,C 4 D,D,D 1 A 4 F 2 C,C 1 </code></pre> And I would like to first remove duplicate within <code>COL1</code> and get: <pre class="prettyprint"><code>COL1 COL2 A 2 B 1 C 4 D 1 A 4 F 2 C 1 </code></pre> and then sum the same <code>COL1</code> letter by the <code>COL2</code> values and get : <pre class="prettyprint"><code>COL1 COL2 A 6 B 1 C 5 D 1 F 2 </code></pre> Does someone have an idea, please? Here is the dataframe if it can helps: <pre class="prettyprint"><code>structure(list(COL1 = structure(c(2L, 3L, 4L, 5L, 1L, 6L, 4L), .Label = c("A", "A,A,A", "B", "C,C", "D,D,D", "F"), class = "factor"), COL2 = c(2, 1, 4, 1, 4, 2, 1)), class = "data.frame", row.names = c(NA, -7L )) </code></pre>

A base R option <pre class="prettyprint"><code>aggregate( COL2 ~ ., transform( df, COL1 = gsub(",.*", "", COL1) ), sum ) </code></pre> gives <pre class="prettyprint"><code> COL1 COL2 1 A 6 2 B 1 3 C 5 4 D 1 5 F 2 </code></pre>

An optoin with <code>trimws</code> <pre class="prettyprint"><code>library(dplyr) df1 %>% group_by(COL1 = trimws(COL1, whitespace = ",.*")) %>% summarise(COL2 = sum(COL2), .groups = 'drop') # A tibble: 5 x 2 COL1 COL2 <chr> <dbl> 1 A 6 2 B 1 3 C 5 4 D 1 5 F 2 </code></pre>

You can use <code>separate_rows</code> to split the data on comma in different rows, keep only unique values and aggregate. <pre class="prettyprint"><code>library(dplyr) library(tidyr) df %>% mutate(row = row_number()) %>% separate_rows(COL1, sep = ',\\s*') %>% distinct(row, COL1, .keep_all = TRUE) %>% group_by(COL1) %>% summarise(COL2 = sum(COL2, na.rm = TRUE)) # COL1 COL2 # <chr> <dbl> #1 A 6 #2 B 1 #3 C 5 #4 D 1 #5 F 2 </code></pre>

Remove duplicate element within a row in a specific column

Tags:

regex

dataframe

r

dplyr

subset

I have a dataframe such as ;

COL1  COL2
A,A,A 2
B     1
C,C   4
D,D,D 1
A     4
F     2
C,C   1

And I would like to first remove duplicate within COL1 and get:

and then sum the same COL1 letter by the COL2 values and get :

COL1  COL2
A     6
B     1
C     5
D     1
F     2

Does someone have an idea, please? Here is the dataframe if it can helps:

structure(list(COL1 = structure(c(2L, 3L, 4L, 5L, 1L, 6L, 4L), .Label = c("A", 
"A,A,A", "B", "C,C", "D,D,D", "F"), class = "factor"), COL2 = c(2, 
1, 4, 1, 4, 2, 1)), class = "data.frame", row.names = c(NA, -7L
))

985

asked Jul 25 '21 08:07

chippycentra

3 Answers

A base R option

aggregate(
  COL2 ~ .,
  transform(
    df,
    COL1 = gsub(",.*", "", COL1)
  ),
  sum
)

gives

  COL1 COL2
1    A    6
2    B    1
3    C    5
4    D    1
5    F    2

115

answered Nov 15 '22 01:11

ThomasIsCoding

An optoin with trimws

library(dplyr)
df1 %>%
     group_by(COL1 = trimws(COL1, whitespace = ",.*")) %>% 
     summarise(COL2 = sum(COL2), .groups = 'drop')
# A tibble: 5 x 2
  COL1   COL2
  <chr> <dbl>
1 A         6
2 B         1
3 C         5
4 D         1
5 F         2

answered Nov 14 '22 23:11

akrun

You can use separate_rows to split the data on comma in different rows, keep only unique values and aggregate.

library(dplyr)
library(tidyr)

df %>%
  mutate(row = row_number()) %>%
  separate_rows(COL1, sep = ',\\s*') %>%
  distinct(row, COL1, .keep_all = TRUE) %>%
  group_by(COL1) %>%
  summarise(COL2 = sum(COL2, na.rm = TRUE))

#  COL1   COL2
#  <chr> <dbl>
#1 A         6
#2 B         1
#3 C         5
#4 D         1
#5 F         2

answered Nov 15 '22 00:11

Ronak Shah

Related questions
                            
                                Identifying the outliers in a data set in R
                            
                                Changing column types with dplyr
                            
                                Replace specific column "words" into number or blank
                            
                                Cumulative histogram with percentage on the Y axis
                            
                                R: loop through data frame extracting subset of data depending on date
                            
                                not enough distinct predictions to compute area under roc
                            
                                dplyr - Multiple summary functions
                            
                                How to update values in a dplyr pipe?
                            
                                Creating a new data frame in R from an exisiting, inadequate data frame
                            
                                subset function with "different than"?
                            
                                Change Date print format from yyyy-mm-dd to dd-mm-yyyy
                            
                                Error running R in Linux
                            
                                Splitting a string into new rows in R [duplicate]
                            
                                Splitting text column into ragged multiple new columns in a data table in R
                            
                                Filter data table by dynamic column name
                            
                                Sum of intervals lengths from an integer vector
                            
                                How to get all possible subsets of a character vector in R?
                            
                                How to calculate cumulative sum? [duplicate]
                            
                                Dummify character column and find unique values [duplicate]
                            
                                summing multiple columns in an R data-frame quickly [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With