Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Match correct addition of two binary numbers with PCRE regex

Tags:

regex

pcre

Is it possible to match an addition in form of (?<a>[01]+)\s*\+\s*(?<b>[01]+)\s*=\s*(?<c>[01]+), where a + b == c (as in binary addition) must hold?

These should match:

0 + 0 = 0 0 + 1 = 1 1 + 10 = 11 10 + 111 = 1001 001 + 010 = 0011 1111 + 1 = 10000 1111 + 10 = 10010 

These should not match:

0 + 0 = 10 1 + 1 = 0 111 + 1 = 000 1 + 111 = 000 1010 + 110 = 1000 110 + 1010 = 1000 
like image 422
bwoebi Avatar asked Sep 18 '16 02:09

bwoebi


People also ask

What does regex 0 * 1 * 0 * 1 * Mean?

Basically (0+1)* mathes any sequence of ones and zeroes. So, in your example (0+1)*1(0+1)* should match any sequence that has 1. It would not match 000 , but it would match 010 , 1 , 111 etc. (0+1) means 0 OR 1.

How do I match a number in regex?

To match any number from 0 to 9 we use \d in regex. It will match any single digit number from 0 to 9. \d means [0-9] or match any number from 0 to 9. Instead of writing 0123456789 the shorthand version is [0-9] where [] is used for character range.

What is digit in regex?

In regex, the uppercase metacharacter is always the inverse of the lowercase counterpart. \d (digit) matches any single digit (same as [0-9] ). The uppercase counterpart \D (non-digit) matches any single character that is not a digit (same as [^0-9] ).


1 Answers

TL;DR: Yes, it indeed is possible (use with Jx flags):

(?(DEFINE) (?<add> \s*\+\s* ) (?<eq> \s*=\s* ) (?<carry> (?(cl)(?(cr)|\d0)|(?(cr)\d0|(*F))) ) (?<digitadd> (?(?= (?(?= (?(l1)(?(r1)|(*F))|(?(r1)(*F))) )(?&carry)|(?!(?&carry))) )1|0) ) (?<recursedigit>   (?&add) 0*+ (?:\d*(?:0|1(?<r1>)))? (?(ro)|(?=(?<cr>1)?))\k<r> (?&eq) \d*(?&digitadd)\k<f>\b | (?=\d* (?&add) 0*+ (?:\k<r>(?<ro>)|\d*(?<r>\d\k<r>)) (?&eq) \d*(?<f>\d\k<f>)\b) \d(?&recursedigit) ) (?<checkdigit> (?:0|1(?<l1>)) (?=(?<cl>1)?) (?<r>) (?<f>) (?&recursedigit) ) (?<carryoverflow>   (?<x>\d+) 0 (?<y> \k<r> (?&eq) 0*\k<x>1 | 1(?&y)0 ) | (?<z> 1\k<r> (?&eq) 0*10 | 1(?&z)0 ) ) (?<recurseoverflow>   (?&add) 0*+ (?(rlast) \k<r> (?&eq) 0*+(?(ro)(?:1(?=0))?|1)\k<f>\b                 | (?:(?<remaining>\d+)(?=0\d* (?&eq) \d*(?=1)\k<f>\b)\k<r> (?&eq) (*PRUNE) 0*\k<remaining>\k<f>\b                    | (?&carryoverflow)\k<f>\b)) | (?=\d* (?&add) 0*+ (?:\k<r>(?<ro>)|(?=(?:\d\k<r>(?&eq)(?<rlast>))?)\d*(?<r>\d\k<r>)) (?&eq) \d*(?<f>\d\k<f>)\b)   \d(?&recurseoverflow) ) (?<s>   (| 0*? (?<arg>[01]+) (?&add) 0+ | 0+ (?&add) 0*? (?<arg>[01]+)) (?&eq) (*PRUNE) 0* \k<arg> | 0*+   (?=(?<iteratedigits> (?=(?&checkdigit))\d (?:\b|(?&iteratedigits)) ))   (?=[01]+ (?&add) [01]+ (?&eq) [01]+ \b)   (?<r>) (?<f>) (?&recurseoverflow) ) ) \b(?&s)\b 

Live demo: https://regex101.com/r/yD1kL7/26

[Update: Due to a bug in PCRE, the code only works for all cases with PCRE JIT active; thanks to Qwerp-Derp for noticing; without JIT e.g. 100 + 101 = 1001 fails to match.]

That's quite a monster regex. So, let's build it step by step to understand what's going on.

Hint: For easier memorization and following through the explanation, let me explain the names of the single or two digit capturing group names (all except the first two are flags [see below]):

r => right; it contains the part of the right operand right to a given offset f => final; it contains the part of the final operand (the result) right to a given offset cl => carry left; the preceding digit of the left operand was 1 cr => carry right; the preceding digit of the right operand was 1 l1 => left (is) 1; the current digit of the left operand is 1 r1 => right (is) 1; the current digit of the right operand is 1 ro => right overflow; the right operand is shorter than the current offset rlast => right last; the right operand is at most as long as the current offset 

For a more readable + and = with possible leading and trailing spaces, there are two capturing groups (?<add> \s*\+\s*) and (?<eq> \s*=\s*).

We are performing an addition. As it is regex, we need to verify each digit at once. So, have a look at the math behind this:

Checking addition of a single digit

current digit = left operand digit + right operand digit + carry of last addition 

How do we know the carry?

We can simply look at the last digit:

carry =    left operand digit == 1 && right operand digit == 1         || (left operand digit == 1 || right operand digit == 1) && result digit == 0 

This logic is provided by the capturing group carry, defined as follows:

(?<carry> (?(cl)(?(cr)|\d0)|(?(cr)\d0|(*F))) ) 

with cl being whether the last left operand digit was 1 or not and cr whether the last right operand digit was 1 or not; \d0 is to check whether the last digit in the result was 0.

Note: (?(cl) ... | ...) is a conditional construct checking whether a capturing group has been defined or not. Due to capturing groups being scoped to each level of recursion, this is easily usable as a mechanism to set a boolean flag (can be set with (?<cl>) anywhere) which can later be conditionally acted upon.

Then the actual addition is a simple:

is_one = ((left operand digit == 1) ^ (right operand digit == 1)) ^ carry 

expressed by the digitadd capturing group (using a ^ b == (a && !b) || (!a && b), using l1 being whether the digit of the left operand is equal to 1 and r1 equivalently for the right digit:

(?<digitadd> (?(?= (?(?= (?(l1)(?(r1)|(*F))|(?(r1)(*F))) )(?&carry)|(?!(?&carry))) )1|0) ) 

Checking addition at a given offset

Now, that we can verify, given a defined cr, cl, l1 and r1 whether the digit at the result is correct by simply invoking (?&digitadd) at that offset.

... at that offset. That's the next challenge, finding said offset.

The fundamental issue is, given three strings with a known separator in between, how to find the correct offset from the right.

For example 1***+****0***=****1*** (the separators are + and = here, and * denoting any arbitrary digit).

Or even, more fundamentally: 1***+****0***=1.

This can be matched with:

(?<fundamentalrecursedigit>   \+ \d*(?:1(?<r1>)|0)\k<r> = (?(r1) (?(l1) 0 | 1) | (?(l1) 1 | 0) ) \b | (?=\d* + \d*(?<r>\d\k<r>) =) \d (?&fundamentalrecursedigit) ) (?<fundamentalcheckdigit>   # Note: (?<r>) is necessary to initialize the capturing group to an empty string   (?:1(?<l1>)|0) (?<r>) (?&fundamentalrecursedigit) ) 

(Many thanks here to nhahdth for his solution to this issue — altered here a little to fit the example)

First we're storing information about the digit at the current offset ((?:1(?<l1>)|0) - set the flag (i.e. capturing group which can be checked against with (?(flag) ... | ...)) l1 when the current digit is 1.

Then we build the string to the right of the searched offset of the right operand recursively with (?=\d* + \d*(?<r>\d\k<r>) =) \d (?&fundamentalrecursedigit), which advances by one digit (on the left operand) on each level of recursion and adds one digit more to the right part of the right operand: (?<r> \d \k<r>) re-defines the r capturing group and adds another digit to the already existing capture (referenced with \k<r>).

Thus, as this recurses as long as there are digits on the left operand and exactly one digit is added to the r capturing group per level of recursion, that capturing group will contain exactly as many characters as there were digits on the left operand.

Now, at the end of the recursion (i.e. when the separator + is reached), we can go straight to the correct offset via \d*(?:1(?<r1>)|0)\k<r>, as the searched digit will now be exactly the digit before what the capturing group r matched.

Having now also the r1 flag conditionally set, we can go to the end, checking the result for correctness with simple conditionals: (?(r1) (?(l1) 0 | 1) | (?(l1) 1 | 0).

Given this, it is trivial to extend this to 1***+****0***=****1***:

(?<fundamentalrecursedigit>   \+ \d*(?:1(?<r1>)|0)\k<r> = \d*(?(r1) (?(l1) 0 | 1) | (?(l1) 1 | 0) )\k<f> \b | (?=\d* + \d*(?<r>\d\k<r>) = \d*(?<f>\d\k<f>)\b) \d (?&fundamentalrecursedigit) ) (?<fundamentalcheckdigit>   (?:1(?<l1>)|0) (?<r>) (?<f>) (?&fundamentalrecursedigit) ) 

by using the exactly same trick, collecting also the right part of the result in an f capturing group and accessing the offset right before what this capturing group f matched.

Let's add support for a carry, which really is just setting the cr and cl flags whether the next digit was 1 via (?=(?<cr/cl>1)?) after the current digits of the left and right operands:

(?<carryrecursedigit>   \+ \d* (?:1(?<r1>)|0) (?=(?<cr>1)?) \k<r> = \d* (?&digitadd) \k<f> \b | (?=\d* + \d*(?<r>\d\k<r>) = \d*(?<f>\d\k<f>)\b) \d (?&carryrecursedigit) ) (?<carrycheckdigit>   (?:1(?<l1>)|0) (?=(?<cl>1)?) (?<r>) (?<f>) (?&carryrecursedigit) ) 

Checking inputs of equal lengths

Now, we could be done here if we were to pad all the inputs with enough zeroes:

\b (?=(?<iteratedigits> (?=(?&carrycheckdigit)) \d (?:\b|(?&iteratedigits)) )) [01]+ (?&add) [01]+ (?&eq) [01]+ \b 

i.e. assert for each digit of the left operand recursively that the addition can be performed and then succeed.

But obviously, we aren't done yet. What about:

  1. the left operand being longer than the right?
  2. the right operand being longer than the left?
  3. the left operand being as long or longer than the right operand and the result having a carry at the most significant digit (i.e. needs a leading 1)?

Handle a left operand longer than the right one

That one is pretty trivial, just stop trying to add digits to the r capturing group when we have captured it completely, set a flag (here: ro) to not consider it eligible for carry anymore and make a digit leading r optional (by (?:\d* (?:1(?<r1>)|0))?):

(?<recursedigit>   \+ (?:\d* (?:1(?<r1>)|0))? (?(ro)|(?=(?<cr>1)?)) \k<r> = \d* (?&digitadd) \k<f> \b | (?=\d* + (?:\k<r>(?<ro>)|\d*(?<r>\d\k<r>)) = \d*(?<f>\d\k<f>)\b) \d (?&recursedigit) ) (?<checkdigit>   (?:1(?<l1>)|0) (?=(?<cl>1)?) (?<r>) (?<f>) (?&recursedigit) ) 

This is now handling the right operand as if it were zero-padded; r1 and cr now just never will be set after that point. Which is all what we need.

It might be easy to be confused here why we can set the ro flag immediately when exceeding the right operators length and immediately ignore the carry; the reason is checkdigit already consuming the first digit at the current position, thus we are actually already more than one digit past the end of the right operand.

The right operand happens to be longer than the left

This is now a bit harder. We can't cram it into recursedigit as that one will just iterate as often as there are digits in the left operand. Thus, we need a separate match for that.

There are now a few cases to consider:

  1. There is a carry from the addition of the most significant digit of the left operand
  2. There is no carry.

For the former case we want to match 10 + 110 = 1000, 11 + 101 = 1000; for the latter case we want to match 1 + 10 = 11 or 1 + 1000 = 1001.

To simplify our task, we're going to ignore leading zeroes for now. Then we know that the most significant digit is 1. There is now no carry only and only if:

  • the digit at the current offset (i.e. the offset of the most significant digit of the left operand) in the right operand is 0
  • and there was no carry from the previous offset, meaning the digit at the current in the result is 1.

This translates into the following assertion:

\d+(?=0)\k<r> (?&eq) \d*(?=1)\k<f>\b 

In that case, we can capture the first \d+ with (?<remaining>\d+) and require it to be in front of \k<f> (the part on the right of the current offset of the result):

(?<remaining>\d+)\k<r> (?&eq) \k<remaining>\k<f>\b 

Otherwise, if there is a carry, we need to increment the left part of the right operand:

(?<carryoverflow>   (?<x>\d+) 0 (?<y> \k<r> (?&eq) \k<x>1 | 1(?&y)0 ) | (?<z> 1\k<r> (?&eq) 10 | 1(?&z)0 ) ) 

and use it as:

(?&carryoverflow)\k<f>\b 

This carryoverflow capturing group works by copying the left part of the right operand, finding the last zero there and replacing all ones less significant than that zero by zeroes and the zero by an one. In case there is no zero in that part, the ones are simply all replaced by a zero and a leading one is added.

Or to express it less wordily (with n being arbitrary, so that it fits):

  (?<x>\d+) 0 1^n \k<r> (?&eq) \k<x> 1 0^n \k<f>\b | 1^n \k<r> (?&eq) 1 0^n \k<f>\b 

So, let's apply our usual technique to figure out the parts on the right of the operands:

(?<recurselongleft>   (?&add) (?:(?<remaining>\d+)(?=(?=0)\k<r> (?&eq) \d*(?=1)\k<f>\b)\k<r> (?&eq) (*PRUNE) \k<remaining>\k<f>\b             | (?&carryoverflow)\k<f>\b) | (?=\d* (?&add) \d*(?<r>\d\k<r>) (?&eq) \d*(?<f>\d\k<f>)\b) \d(?&recurselongleft) ) 

Note that I've added a (*PRUNE) after (?&eq) in the first branch in order to avoid backtracking into the second branch with carryoverflow, in case there happens to be no carry and the result not matching.

Note: We do not do any checks against the \k<f> part here, this is managed by the carrycheckdigit capturing group from above.

The case of the leading 1

We sure do not want 1 + 1 = 0 to be matched. Which it is though if we go by checkdigit alone. So, there are different circumstances when that leading 1 is necessary (if not covered yet by the previous case of the right operand being longer):

  • Both operands (being without leading zeroes) are of equal length (i.e. they both have an 1 at their most significant digit, which, added together, leaves a carry)
  • The left operand is longer and there is a carry at the most significant digit or both strings are just as long.

The former case handles inputs like 10 + 10 = 100, the second case handles 110 + 10 = 1000 as well as 1101 + 11 = 10100 and the last one trivially 111 + 10 = 1001.

The first case can be handled by setting a flag ro whether the left operand is longer than the right one, which then can be checked against at the end of recursion:

(?<recurseequallength>   (?&add) \k<r> (?&eq) (?(ro)|1)\k<f>\b | (?=\d* (?&add) (?:\k<r>(?<ro>) | \d*(?<r>\d\k<r>)) (?&eq) \d*(?<f>\d\k<f>)\b) \d(?&recurseequallength) ) 

The second case means we just need to check for the existence of a carry in case of ro (i.e. the right operand being shorter). The carry can as usually be checked (as the most significant digit is always 1 and the right operands digit is implicitly 0 then) with a trivial (?:1(?=0))?\k<f>\b - if there was a carry the digit at the current offset in the result will be 0.

That's easily possible, as, after all, all the other digits up to the current offset will all be verified by checkdigit before. Hence we could just check the carry locally here.

Thus we can add this to the first branch of the recurseequallength alternation:

(?<recurseoverflow>   (?&add) (?(rlast) \k<r> (?&eq) (?(ro)(?:1(?=0))?|1)\k<f>\b                 | (?:(?<remaining>\d+)(?=0\d* (?&eq) \d*(?=1)\k<f>\b)\k<r> (?&eq) (*PRUNE) \k<remaining>\k<f>\b                    | (?&carryoverflow)\k<f>\b)) | (?=\d* (?&add) (?:\k<r>(?<ro>)|(?=(?:\d\k<r>(?&eq)(?<rlast>))?)\d*(?<r>\d\k<r>)) (?&eq) \d*(?<f>\d\k<f>)\b)   \d(?&recurseoverflow) ) 

Then to wire everything together: first check checkdigit for each individual digit (same as for the simple zero-padded case from before) and then initialize the different capturing groups used by recurseoverflow:

\b (?=(?<iteratedigits> (?=(?&checkdigit))\d (?:\b|(?&iteratedigits)) )) (?=[01]+ (?&add) [01]+ (?&eq) [01]+ \b) (?<r>) (?<f>) (?&recurseoverflow) \b 

What about the zeroes?

0 + x = x and x + 0 = x are still unhandled and will fail.

Instead of hacking out big capturing groups to handle it uglily, we resort to handling them manually:

(0*? (?<arg>[01]+) (?&add) 0+ | 0+ (?&add) 0*? (?<arg>[01]+)) (?&eq) 0* \k<arg> 

Note: When using in alternation with the main branch, we need to put a (*PRUNE) after the (?&eq) in order to avoid jumping into that main branch when any operand is zero and the match failing.

Now, we also always assumed there were no leading zeroes in the inputs for simplifying our expressions. If you look at the initial regex, you will find many occurrences of 0* and 0*+ (possessive to avoid backtracking into it and … unexpected things happening) in order to skip the leading zeroes as we are assuming at some places that the leftmost digit is an 1.

Conclusion

And that's it. We achieved matching only correct additions of binary numbers.

A small note about the relatively new J flag: it allows to redefine capturing groups. This is in first place important in order to initialize the capturing groups to an empty value. Additionally it simplifies some conditionals (like addone) as we only have to check for one value instead of two. Compare (?(a) ... | ...) vs. (?(?=(?(a)|(?(b)|(*F)))) ... | ...). Also, it is not possible to reorder the multiply defined capturing groups arbitrarily inside the (?(DEFINE) ...) construct.

Final note: Binary addition is not a Chomsky type 3 (i.e. regular) language. This is a PCRE specific answer, using PCRE specific features. [Other regex flavors like .NET may be able to solve it too, but not all may.]

If there are any questions, please leave a comment, I'll then try to clarify this in this answer.

like image 75
bwoebi Avatar answered Sep 20 '22 20:09

bwoebi