Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to exclude one set of words but include another in qregexp?

Tags:

regex

qregexp

I am trying to exclude a group of words but include another group of words in a qregexp expression but I am currently having issues figuring this out.

Here are some of the things I tried (this example included all of the words):

(words|I|want|to|include)(?!the|ones|that|should|not|match)

So I tried this (which returned nothing):

^(words|I|want|to|include)(?:(?!the|ones|that|should|not|match).)*$

Am I missing something?

Edit: The reason why I need such an unusual regex (include/exclude) is because I want to search through a series of articles and filter the ones that have the included words in them but not if they also have the excluded words in them.

So for example if article A is:

Lorem ipsum dolor sit amet, consectetur adipiscing elit.

and article B is:

Vivamus fermentum semper porta.

Then a regex that includes lorem would filter article A but not B. But if ipsum is a word that I'm excluding, I do not want article A to be filtered.

I considered doing a regex to filter out the articles with the words that I want and then run a second regex excluding articles from the first set that I do not want, but unfortunately the software I am using does not allow me to do this. I can only run one regular expression.

like image 716
thequerist Avatar asked Feb 09 '23 07:02

thequerist


2 Answers

I think there is no need in a tempered greedy quantifier. Use excluded words as alternatives inside an anchored negative look-ahead. Let me guide you through this.

You say, you have Lorem ipsum dolor sit amet, consectetur adipiscing elit., and you want it to match since it contains the word lorem. The regex is \\blorem\\b (with QRegExp.CaseInsensitive set to 1) where \b is used to force whole word matching. To prevent the match in case the string contains the word ipsum, you need to use the lookahead at the very beginning of the string.

^(?!.*\\bipsum\\b).*\\blorem\\b

Now, it does not match the string in question.

To add more alternatives, we can use an alternation operator |, and we can do it like this: ^(?!.*\\b(?:words|to|exclude)\\b).*\\b(?:words|to|include)\\b. Note the use of non-capturing groups, it does not store any captured texts and potentially improves performance as compared to capturing groups that save the matched text in a buffer.

Thus, you get

^(?!.*\\b(?:the|ones|that|should|not|match)\\b).*\\b(?:words|I|want|to|include)\\b

See demo

Two remarks:

  1. At the demo Web site, single backslashes must be used, I am doubling them here for the QRegExp.
  2. In Qt, . in the pattern matches any character including a newline. At the demo Web site, the dot does not match newline symbols. You may want to replace it with [^\n] if you need the same functionality, but I think it is not necessary.
like image 52
Wiktor Stribiżew Avatar answered Apr 27 '23 23:04

Wiktor Stribiżew


^(?:(?!\b(?:the|ones|that|should|not|match)\b).)*\b(?:words|I|want|to|include)\b(?:(?!\b(?:the|ones|that|should|not|match)\b).)*$

You need to add lookahead to both parts after you find words whcih should match.See demo.

https://regex101.com/r/bK9wF1/3

or

^(?!.*\b(?:the|ones|that|should|not|match)\b)(?=.*\b(?:words|I|want|to|include)\b).*$

Add both conditions under lookaheads.See demo.

https://regex101.com/r/uF4oY4/60

like image 30
vks Avatar answered Apr 27 '23 22:04

vks