Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Trouble with gsub and regex in R

Tags:

regex

r

gsub

I am using gsub in R to add text into the middle of a string. It works perfectly but for some reason, when the location gets too long it throws an error. The code is below:

gsub(paste0('^(.{', as.integer(loc[1])-1, '})(.+)$'), new_cols, sql)
Error in gsub(paste0("^(.{273})(.+)$"), new_cols, sql) :  invalid
  regular expression '^(.{273})(.+)$', reason 'Invalid contents of {}'

This code works fine when the number in the brackets(273 in this case) is less but not when it is this large.


This produces the error:

sql <- "The cat with the bat went to town. He ate the fat mat and wouldn't stop til the sun came up. He was a fat cat that lived with a rat who owned many hats.The cat with the bat went to town. He ate the fat mat and wouldn't stop til the sun came up. He was a fat cat that lived with a rat who owned many hats."  
new_cols <- "happy" 
gsub('^(.{125})(.+)$', new_cols, sql)  #**Works
gsub('^(.{273})(.+)$', new_cols, sql) 
Error in gsub("^(.{273})(.+)$", new_cols, sql) :    invalid regular
  expression '^(.{273})(.+)$', reason 'Invalid contents of {}'
like image 219
Soxman Avatar asked May 19 '16 12:05

Soxman


People also ask

Does GSUB use regex?

Regular expressions (shortened to regex) are used to operate on patterns found in strings. They can find, replace, or remove certain parts of strings depending on what you tell them to do.

What is GSUB in regex?

gsub stands for global substitution (replace everywhere). It replaces every occurrence of a regular expression (original string) with the replacement string in the given string.

How does GSUB work in R?

The gsub() function in R is used for replacement operations. The functions takes the input and substitutes it against the specified values. The gsub() function always deals with regular expressions. You can use the regular expressions as the parameter of substitution.

Is GSUB slow?

#gsub is not only slower, but it also requires an extra effort for the reader to 'decode' the arguments.


1 Answers

Background

R gsub uses TRE regex library by default. The boundaries in the limiting quantifier are valid from 0 till RE_DUP_MAX that is defined in the TRE code. See this TRE reference:

A bound is one of the following, where n and m are unsigned decimal integers between 0 and RE_DUP_MAX

It seems that the RE_DUP_MAX is set to 255 (see this TRE source file showing #define RE_DUP_MAX 255), and thus, you cannot use more in {n,m} limiting quantifier.

Solution

Use PCRE regex flavor, add perl = TRUE and it will work.

R demo:

> sql <- "The cat with the bat went to town. He ate the fat mat and wouldn't stop til the sun came up. He was a fat cat that lived with a rat who owned many hats.The cat with the bat went to town. He ate the fat mat and wouldn't stop til the sun came up. He was a fat cat that lived with a rat who owned many hats."
> new_cols <- "happy"
> gsub('^(.{273})(.+)$', new_cols, sql, perl=TRUE)
[1] "happy"
like image 194
Wiktor Stribiżew Avatar answered Oct 18 '22 18:10

Wiktor Stribiżew