The first two columns of dataframe make a composite key and there's a column of type char which contains comma separated integers. My objective is to make a column which contains the count of unique integers in the string.
I know the approach of converting string to columns using str_split_fixed and then counting the unique values but due to the length of string a large number of columns are added and everything lags. Is there any other method?
The actual data set contains 500k rows and 53 columns.
Sample dataset :
df
c1 c2 c3
aa 11 1,13,4,5,4,7,9
bb 22 2,5,2,4,5,7,11,
cc 33 11,14,3,1,
dd 44 1,1,2,4,5,6,15,
ee 55 4,3,3,1,14,17,
desired output:
c1 c2 c3 c4
------ | ------ | ------ | -----
aa | 11 | 1,13,4,5,4,7,9 | 6
------ | ------ | ------ | -----
bb | 22 | 2,5,2,4,5,7,11, | 5
------ | ------ | ------ | -----
cc | 33 | 11,14,3,1, | 4
------ | ------ | ------ | -----
dd | 44 | 1,1,2,4,5,6,15, | 6
------ | ------ | ------ | -----
ee | 55 | 4,3,3,1,7,17,7, | 5
------ | ------ | ------ | -----
Any help would be appreciated!
SQL Pattern: How can I count the number of comma separated values in a string? (Community) Basically, you replace all occurrences of , with an empty string "" , then subtract its LENGTH from the LENGTH of the unadulterated string, which gives you the number of , characters.
You can use the combination of the SUM and COUNTIF functions to count unique values in Excel. The syntax for this combined formula is = SUM(IF(1/COUNTIF(data, data)=1,1,0)). Here the COUNTIF formula counts the number of times each value in the range appears.
Select the cell you will place the counting result, type the formula =LEN(A2)-LEN(SUBSTITUTE(A2,",","")) (A2 is the cell where you will count the commas) into it, and then drag this cell's AutoFill Handle to the range as you need.
Using strsplit
with uniqueN
from the data.table
-package:
df$c4 <- sapply(strsplit(df$c3,','), uniqueN)
which gives:
> df
c1 c2 c3 c4
1 aa 11 1,13,4,5,4,7,9 6
2 bb 22 2,5,2,4,5,7,11, 5
3 cc 33 11,14,3,1, 4
4 dd 44 1,1,2,4,5,6,15, 6
5 ee 55 4,3,3,1,14,17, 5
NOTE: if df$c3
is a factor-variable, wrap it in as.character
: sapply(strsplit(as.character(df$c3), ','), uniqueN)
Another base R alternative for creating df$c4
:
sapply(regmatches(df$c3, gregexpr('\\d+', df$c3)), function(x) length(unique(x)))
A tidyverse
alternative:
library(dplyr)
library(tidyr)
df %>%
separate_rows(c3) %>%
filter(c3 != '') %>%
group_by(c1) %>%
summarise(c4 = n_distinct(c3)) %>%
left_join(df, .)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With