I have a dataframe such as ;
COL1 COL2
A,A,A 2
B 1
C,C 4
D,D,D 1
A 4
F 2
C,C 1
And I would like to first remove duplicate within COL1
and get:
COL1 COL2
A 2
B 1
C 4
D 1
A 4
F 2
C 1
and then sum the same COL1
letter by the COL2
values and get :
COL1 COL2
A 6
B 1
C 5
D 1
F 2
Does someone have an idea, please? Here is the dataframe if it can helps:
structure(list(COL1 = structure(c(2L, 3L, 4L, 5L, 1L, 6L, 4L), .Label = c("A",
"A,A,A", "B", "C,C", "D,D,D", "F"), class = "factor"), COL2 = c(2,
1, 4, 1, 4, 2, 1)), class = "data.frame", row.names = c(NA, -7L
))
Select the range you want to remove duplicate rows. If you want to delete all duplicate rows in the worksheet, just hold down Ctrl + A key to select the entire sheet. 2. On Data tab, click Remove Duplicates in the Data Tools group.
A base R option
aggregate(
COL2 ~ .,
transform(
df,
COL1 = gsub(",.*", "", COL1)
),
sum
)
gives
COL1 COL2
1 A 6
2 B 1
3 C 5
4 D 1
5 F 2
An optoin with trimws
library(dplyr)
df1 %>%
group_by(COL1 = trimws(COL1, whitespace = ",.*")) %>%
summarise(COL2 = sum(COL2), .groups = 'drop')
# A tibble: 5 x 2
COL1 COL2
<chr> <dbl>
1 A 6
2 B 1
3 C 5
4 D 1
5 F 2
You can use separate_rows
to split the data on comma in different rows, keep only unique values and aggregate.
library(dplyr)
library(tidyr)
df %>%
mutate(row = row_number()) %>%
separate_rows(COL1, sep = ',\\s*') %>%
distinct(row, COL1, .keep_all = TRUE) %>%
group_by(COL1) %>%
summarise(COL2 = sum(COL2, na.rm = TRUE))
# COL1 COL2
# <chr> <dbl>
#1 A 6
#2 B 1
#3 C 5
#4 D 1
#5 F 2
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With