Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove repeated elements in a string with R

Tags:

string

regex

r

I plan to remove repeated elements (each containing two or more characters) from strings. For example, from "aaa" I expect "aaa", from "aaaa" I expect "aa", from "abababcdcd" I epxect "abcd", from "cdababcdcd" I expect "cdabcd".

I tried gsub("(.{2,})\\1+","\\1",str). It works in cases 1-3, but fails in case 4. How to solve this problem?

like image 870
Ming Avatar asked Feb 20 '19 14:02

Ming


Video Answer


1 Answers

SOLUTION

The solution is to rely on the PCRE or ICU regex engines, rather than TRE.

Use either base R gsub with perl=TRUE (it uses PCRE regex engine) and "(?s)(.{2,})\\1+" pattern, or a stringr::str_replace_all() (it uses ICU regex engine) with the same pattern:

> x <- "cdababcdcd"
> gsub("(?s)(.{2,})\\1+", "\\1", x, perl=TRUE)
[1] "cdabcd"
> library(stringr)
> str_replace_all(x, "(?s)(.{2,})\\1+", "\\1")
[1] "cdabcd"

The (?s) flag is necessary for . to match any char including line break chars (in TRE regex, . matches all chars by default).

DETAILS

TRE regex is not good at handling "pathological" cases that are mostly related to backtracking, which directly involves quantifiers (I bolded some parts):

The matching algorithm used in TRE uses linear worst-case time in the length of the text being searched, and quadratic worst-case time in the length of the used regular expression. In other words, the time complexity of the algorithm is O(M2N), where M is the length of the regular expression and N is the length of the text. The used space is also quadratic on the length of the regex, but does not depend on the searched string. This quadratic behaviour occurs only on pathological cases which are probably very rare in practice.

Predictable matching speed
Because of the matching algorithm used in TRE, the maximum time consumed by any regexec() call is always directly proportional to the length of the searched string. There is one exception: if back references are used, the matching may take time that grows exponentially with the length of the string. This is because matching back references is an NP complete problem, and almost certainly requires exponential time to match in the worst case.

In those cases when TRE has trouble calculating all possibilities of matching a string it does not return any match, the string is returned as is. Hence, there is no changes in the gsub call.

like image 154
Wiktor Stribiżew Avatar answered Sep 29 '22 16:09

Wiktor Stribiżew