Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

nul bytes in regexp MATLAB

Tags:

Can someone explain what MATLAB is doing with nul bytes (x00) in regular expressions?

Examples:

>> regexp(char([0 0 0 0 0 0 0 1 0 0 10 0 0 0]),char([0 0 0 0 46 0 0 10]))
ans =
      1  % current
      4  % expected

>> regexp(char([0 0 0 1 0 0 0 1 0 0 10 0 0 0]),char([1 0 0 0 46 0 0 10]))
ans =
      4  % current
      4  % expected

>> regexp(char([0 0 0 1 0 0 0 1 0 0 10 0 0 0]),char([0 0 0 0 46 0 0 10]))
ans =
      [] % current
      [] % expected

>> regexp(char([0 0 0 0 10 0 0 1 0 0 10 0 0 0]),char([0 0 0 0 46 0 0 10]))
ans =
      1  % current
      [] % expected

>> regexp(char([0 0 0 0 0 0 0 1 0 0 10 0 0 0]),char([1 0 0 0 46 0 0 10]))
ans =
      [] % current
      [] % expected

The answer might simply be, MATLAB regular expression isn't meant to handle non printable characters, but I would assume it would error if this was the case.

EDIT: The 46 is expected to be '.' as in the regex wildcard.

EDIT2:

>> regexp(char([0 0 0 0 50 0 0 100 0 0 90 0 0 0]),char([0 0 46 0 0 90]))
ans =
     1    9

I realized it could have been 10 being a special character so this one has only printable and nul bytes. I would expect this one to only match 9 because the fifth character 50 does not match 0.

like image 480
horriblyUnpythonic Avatar asked Mar 23 '15 17:03

horriblyUnpythonic


People also ask

What does Regexp do in Matlab?

Description. startIndex = regexp( str , expression ) returns the starting index of each substring of str that matches the character patterns specified by the regular expression.

How do you denote special characters in regex?

To match a character having special meaning in regex, you need to use a escape sequence prefix with a backslash ( \ ). E.g., \. matches "." ; regex \+ matches "+" ; and regex \( matches "(" . You also need to use regex \\ to match "\" (back-slash).

What is a regex token?

A regex is a specification of a pattern to be matched in the searched text. This pattern consists of a sequence of tokens, each being able to match a single character or a sequence of characters in the text, or assert that a specific position within the text has been reached (the latter is called an anchor.)


1 Answers

this bug is probably already fixed. I tested your example from Matlab Central in several versions:

in R2013b:

>> regexp(char([0 0 1 0  41 41 41 41 41 41]),char([0 '.' 0  40 40 40 40]))    
ans =

     2

in R2015a:

>> regexp(char([0 0 1 0  41 41 41 41 41 41]),char([0 '.' 0  40 40 40 40]))   
ans =  

     2

in R2016a:

>> regexp(char([0 0 1 0  41 41 41 41 41 41]),char([0 '.' 0  40 40 40 40]))
ans = 

     []
like image 51
tvo Avatar answered Oct 17 '22 07:10

tvo