Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use word break, asterisk, word break in Regex with Perl?

Tags:

regex

perl

I have a complexe precompiled regular expression in Perl. For most cases the regex is fine and matches everything it should and nothing it shouldn't. Except one point.

Basically my regex looks like:

my $regexp = qr/\b(FOO|BAR|\*)\b/;

Unfortunately m/\b\*\b/ won't match example, *. Only m/\*/ will do which I can't use because of false positives. Is there any workaround?

from the comments - false positives are: **, example*, exam*ple

what the regex is intended for? - It should extract keywords (one is a single asterisk) coworkers have entered into product data. the goal is to move this information out of a freetext field into an atomic one.

like image 720
burnersk Avatar asked Feb 04 '14 15:02

burnersk


People also ask

What is a word boundary in regex?

A word boundary, in most regex dialects, is a position between \w and \W (non-word char), or at the beginning or end of a string if it begins or ends (respectively) with a word character ( [0-9A-Za-z_] ). So, in the string "-12", it would match before the 1 or after the 2.

What is the syntax for regular expressions in Perl?

The syntax of regular expressions in Perl is very similar to what you will find within other regular expression.supporting programs, such as sed, grep, and awk. The basic method for applying a regular expression is to use the pattern binding operators =~ and ! ~.

Which characters are word characters in regex?

Exactly which characters are word characters depends on the regex flavor you’re working with. In most flavors, characters that are matched by the short-hand character class \w are the characters that are treated as word characters by word boundaries.

What is regex and how to use it?

It can also be used to replace text, regex define a search pattern which is used for find and find and replace in string operations. If you want to learn Regex with Simple & Practical Examples, I will suggest you to see this simple and to the point Complete Regex Course with step by step approach & exercises.


1 Answers

It sounds like you want to treat * as a word character.

\b

is equivalent to

(?x: (?<!\w)(?=\w) | (?<=\w)(?!\w) )

so you want

(?x: (?<![\w*])(?=[\w*]) | (?<=[\w*])(?![\w*]) )

Applied, you get the following:

qr/
    (?: (?<![\w*])(?=[\w*]) | (?<=[\w*])(?![\w*]) )
    (FOO|BAR|\*)
    (?: (?<![\w*])(?=[\w*]) | (?<=[\w*])(?![\w*]) )
/x

But given our knowledge of the middle expression, that can be simplified to the following:

qr/(?<![\w*])(FOO|BAR|\*)(?![\w*])/
like image 88
ikegami Avatar answered Sep 24 '22 21:09

ikegami