Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Trying NOT to match a Japanese word using RegEx negative lookbehind

The target structure looks like the following:

検索結果:100,000件

If I use the following regex pattern:

((?<!検索結果:)(?<!次の)(((〇|一|二|三|四|五|六|七|八|九|十|百|千|万|億|兆|京+|[0-90-9]))(,|,|、)?).+((〇|一|二|三|四|五|六|七|八|九|十|百|千|万|億|兆|京|[0-90-9]).+)件)(?!表示)

As you can see, I want to unmatch everything preceded by "検索結果:" & "次の" using this pattern followed by either Arabic numerals or Japanese kanji (Chinese character) numbers. However, the pattern somehow matches up to 4 digits but not 6 digits.

In other words,

次の1000件

works (meaning it doesn't match anything), but

次の5,0000件

gives a partial match ("0000件")

I want to know why up to 4 digits. And ultimately want to find a way to NOT match anything using this regex. I know this regex is a bit messy. Thanks in advance for your feedback!

like image 366
Michael Avatar asked Jan 15 '19 07:01

Michael


People also ask

What is negative Lookbehind regex?

In negative lookbehind the regex engine first finds a match for an item after that it traces back and tries to match a given item which is just before the main match. In case of a successful traceback match the match is a failure, otherwise it is a success.

What is lookbehind in regex?

Introduction to the JavaScript regex lookbehind In regular expressions, a lookbehind matches an element if there is another specific element before it. A lookbehind has the following syntax: (?<=Y)X. In this syntax, the pattern match X if there is Y before it.


1 Answers

You need to avoid matching the numbers after a digit or digit + the separator, so you need to add (?<![0-90-9])(?<![0-90-9][,,、]) right after (?<!次の):

(?<!検索結果:)(?<!次の)(?<![0-90-9])(?<![0-90-9][,,、])(?:[〇一二三四五六七八九十百千万億兆0-90-9]|京+)[,,、]?.+[〇一二三四五六七八九十百千万億兆京0-90-9].+件
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

See the regex demo.

like image 50
Wiktor Stribiżew Avatar answered Oct 24 '22 00:10

Wiktor Stribiżew