Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

In regex, mystery Error: assertion 'tree->num_tags == num_tags' failed in executing regexp: file 'tre-compile.c', line 634

Assume 900+ company names pasted together to form a regex pattern using the pipe separator -- "firm.pat".

firm.pat <- str_c(firms$firm, collapse = "|")

With a data frame called "bio" that has a large character variable (250 rows each with 100+ words) named "comment", I would like to replace all the company names with blanks. Both a gsub call and a str_replace_all call return the same mysterious error.

bio$comment <- gsub(pattern = firm.pat, x = bio$comment, replacement = "")
Error in gsub(pattern = firm.pat, x = bio$comment, replacement = "") : 
  assertion 'tree->num_tags == num_tags' failed in executing regexp: file 'tre-compile.c', line 634

library(stringr)
bio$comment <- str_replace_all(bio$comment, firm.pat,  "")
Error: assertion 'tree->num_tags == num_tags' failed in executing regexp: file 'tre-compile.c', line 634

traceback() did not enlighten me.

> traceback()
4: gsub("aaronson rappaport|adams reese|adelson testan|adler pollock|ahlers cooney|ahmuty demers|akerman|akin gump|allen kopet|allen matkins|alston bird|alston hunt|alvarado smith|anderson kill|andrews kurth|archer 

# hundreds of lines of company names omitted here

lties in all 50 states and washington, dc. results are compiled through a peer-review survey in which thousands of lawyers in the u.s. confidentially evaluate their professional peers."
       ), fixed = FALSE, ignore.case = FALSE, perl = FALSE)
3: do.call(f, compact(args))
2: re_call("gsub", string, pattern, replacement)
1: str_replace_all(bio$comment, firm.pat, "")

Three other posts have mentioned the cryptic error on SO, a passing reference and cites two other oblique references, but with no discussion.

I know this question lacks reproducible code, but even so, how do I find out what the error is explaining? Even better, how do I avoid throwing the error? The error does not seem to occur with smaller numbers of companies but I can't detect a pattern or threshold. I am running Windows 8, RStudio, updated versions of every package.

Thank you.

like image 966
lawyeR Avatar asked Feb 23 '15 22:02

lawyeR


1 Answers

I had the same problem with pattern consisiting of hundreds of manufacters names. As I can suggest the pattern is too long, so I split it in two or more patterns and it works well.

  ml<-length(firms$firm)
  xyz<-gsub(sprintf("(*UCP)\\b(%s)\\b", paste(head(firms$firm,n=ml/2), collapse = "|")), "", bio$comment, perl=TRUE)
  xyz<-gsub(sprintf("(*UCP)\\b(%s)\\b", paste(tail(firms$firm,n=ml/2), collapse = "|")), "", xyz, perl=TRUE)
like image 68
Aleksandr Meshkov Avatar answered Sep 30 '22 11:09

Aleksandr Meshkov