Reduce string length by removing contiguous duplicates

Question

I have an R dataframe whith 2 fields:

ID             WORD
1           AAAAABBBBB
2           ABCAAABBBDDD
3           ...

I'd like to simplify the words with repeating letters by keeping only the letter and not the duplicates in a repetition:

e.g.: AAAAABBBBB should give me AB and ABCAAABBBDDD should give me ABCABD

Anyone has an idea on how to do this?

Blue Magister · Accepted Answer

Here's a solution with regex:

x <- c('AAAAABBBBB', 'ABCAAABBBDDD')
gsub("([A-Za-z])\1+","\1",x)

EDIT: By request, some benchmarking. I added Matthew Lundberg's pattern in the comment, matching any character. It appears that gsub is faster by an order of magnitude, and matching any character is faster than matching letters.

library(microbenchmark)
set.seed(1)
##create sample dataset
x <- apply(
  replicate(100,sample(c(LETTERS[1:3],""),10,replace=TRUE))
,2,paste0,collapse="")
##benchmark
xm <- microbenchmark(
    SAPPLY = sapply(strsplit(x, ''), function(x) paste0(rle(x)$values, collapse=''))
    ,GSUB.LETTER = gsub("([A-Za-z])\1+","\1",x)
    ,GSUB.ANY = gsub("(.)\1+","\1",x)
)
##print results
print(xm)
# Unit: milliseconds
         # expr       min        lq    median        uq       max
# 1    GSUB.ANY  1.433873  1.509215  1.562193  1.664664  3.324195
# 2 GSUB.LETTER  1.940916  2.059521  2.108831  2.227435  3.118152
# 3      SAPPLY 64.786782 67.519976 68.929285 71.164052 77.261952

##boxplot of times
boxplot(xm)
##plot with ggplot2
library(ggplot2)
qplot(y=time, data=xm, colour=expr) + scale_y_log10()

Matthew Lundberg · Answer

x <- c('AAAAABBBBB', 'ABCAAABBBDDD')
sapply(strsplit(x, ''), function(x) paste0(rle(x)$values, collapse=''))
## [1] "AB"     "ABCABD"

Reduce string length by removing contiguous duplicates

Tags:

string

dataframe

r

dimensionality-reduction

Joe

2 Answers

Blue Magister

Matthew Lundberg

Recent Activity

Donate For Us

Reduce string length by removing contiguous duplicates

Tags:

string

dataframe

r

dimensionality-reduction

Joe

2 Answers

Blue Magister

Matthew Lundberg

Related questions

Recent Activity

Donate For Us