Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to validate a formatSpec string in MATLAB?

Many of the functions in MATLAB (and also other language that have a c-derived scanf/printf) used for writing or reading strings (to name a few: sscanf, sprintf, textscan) rely on the user supplying a valid formatSpec string which tells the function the structure of the string-to-build or the string-to-parse. I'm looking for a way to validate such a formatSpec string before using it in a call to sprintf.

In the case of sprintf, the structure of formatSpec is described in the documentation and is as follows:

MATLAB's sprintf formatSpec

Specifically I'd like to point out two aspects of formatSpec:

  • (✓) A formatting operator starts with a percent sign, %, and ends with a conversion character.
  • (x) formatSpec can also include additional text before a percent sign, %, or after a conversion character.

The solution I was thinking about involves using a regular expression to test the passed-in string. What I have so far is an expression that seems to be able to match everything between the initial % and the conversion character, but not the "additional text" that may appear.

(%{1}(\d+\$)?[-+\s0#]*(\d+|\*)?(\.\d+)?[bt]?[diuoxXfeEgGcs]+)+

I wanted to also add the ability to capture "any printable text characters besides %, ' and \, unless these characters appear exactly twice". This needs to be captured both before the initial % and after the conversion character.

  • any printable character: [ -~]
  • besides %, ' and \: (?![\\%'])
  • these characters appear exactly twice: ( §§§§ |'{2}|\\{2}|%{2}) (§ = placeholder)

I am having a problem with the "unless", that is, getting the negative look-ahead to discard single occurrences but allow double occurrences of the specified characters.

I have two questions:

  1. Is there a better way to validate formatSpec strings (i.e. w/o regex, or using a better one)?
  2. How do I fix my regex so that it works as described?

Disambiguation note: in the case of a formatSpec string that has "free text" on both sides of formatting operators, the text should be considered a part of the next formatting operator, unless there are none left. Below is an example of how a formatSpec string should be split using the regex (where | is the first char of each match):

Color %s, number1 %d, number2 %05d, hex %#x, float %5.2f, unsigned value %u.
|       |           |             |        |            |                   |
like image 414
Dev-iL Avatar asked Jul 03 '16 11:07

Dev-iL


2 Answers

I've spent a bit of time on this, and I think I'm close, so I'll write up my current progress in an answer. I'm fairly certain it could still be improved.

First, the code, using the nice example string by Ro Yo Mi:

% valid input
sample_good = 'Color %s, we are looking for %%02droids %% number1 %d, number2 %05d, hex %#x, float %5.2f, unsigned value %u.';
% invalid input: "%02 droids" has a single percent sign which is not part of an operator
sample_bad = 'Color %s, we are looking for %02 droids %% number1 %d, number2 %05d, hex %#x, float %5.2f, unsigned value %u.';

group_from = '(';
group_to = ')';
printable = '([ -$&-\[\]-~]|%%|\\\\)*';
atomic_op = '(?<!%)%(\d+\$)?[ +#-]*(\d+|\*)?(\.\d*)?[bt]?[diuoxXfeEgGcs]';

% pattern for full validation
full_patt = ['^' group_from printable atomic_op group_to '*' printable '$'];
% pattern for splitting valid strings
part_patt = [printable atomic_op];

% examples
matches_full_bad = regexp(sample_bad,full_patt);            % no match
matches_full_good = regexp(sample_good,full_patt);          % match
matches_parts_good = regexp(sample_good,part_patt,'match'); % sliced matches

The first example string is valid, the second is broken due to %02 droids being part of the string. I defined a few auxiliary patterns; note that most of these have groups in them already. The printable pattern uses everything ASCII except % and \, plus %% and \\. Note that in order to match a double backslash, we need four backshlashes (two escaped backslashes) in the search expression.

What I call atomic_op is a pattern that matches format operators starting with a percent sign and ending with a conversion character. It uses negative lookbehind to avoid matching fake format operators starting with %%. I took some shortcuts due to laziness (such as te being valid in my version). It should be quite functional for not-too-evil inputs.

The most important parts are full_patt and part_patt. The former tries to match a full format spec, in order to determine if it's valid. Unfortunately, in case of nested groups MATLAB only stores tokens for the outermost level; in our case this would not be useful. This is where part_patt comes into play: it only matches "printable_string format_operator". Used together with full_patt it can be used to slice up your full string to meaningful contributions. Note that part_patt will often match an invalid string too at its locally-valid positions, so the two really have to be used together.

Consider the specific example above:

>> matches_full_bad

matches_full_bad =

     []

>> matches_full_good

matches_full_good =

     1

>> matches_parts_good{:}

ans =

Color %s


ans =

, we are looking for %%02droids %% number1 %d


ans =

, number2 %05d


ans =

, hex %#x


ans =

, float %5.2f


ans =

, unsigned value %u

Let's analyse the results. The "bad" pattern returns a (falsy) empty vector, while the "good" pattern returns a (truthy) 1 for the full pattern. The partial pattern then returns correctly each relevant subpattern of the input. Note, however, that the final period at the end of the sentence is missing from the result, since we matched blocks of printable atomic_op. Since we know that we're working with a valid string, the rest of the string (after the final match) should be assign to either a new match, or to the final one, depending on your preference.

Just for clarity, here's how I imagine this to work:

for sample={sample_bad,sample_good},
    if regexp(sample{1},full_patt)
        disp('Match found!');
        matches = regexp(sample,part_patt,'match');
        matches = matches{1};   % strip outermost singleton cell dimension
        for k=1:length(matches)
            fprintf('Format substring #%d: %s\n',k, matches{k});
        end
        %TODO: treat final printable part of the string
    else
        disp('Uh-oh, no match!')
    end
end
like image 163

Description

((?:[ -$&(-[\]-~]|([%'\\])\2)*(%(\d+\$)?[-+\s0#]?(\d+|\*)?(\.\d+)?[bt]?[diuoxXfeEgGcs]+)+(?:(?!(?:[ -$&(-[\]-~]|([%'\\])\7)*(?:%(?:\d+\$)?[-+\s0#]?(?:\d+|\*)?(?:\.\d+)?[bt]?[diuoxXfeEgGcs]+)+)(?:[ -$&(-[\]-~]|([%'\\])\8)*)?)

Regular expression visualization

** To see the image better, simply right click the image and select view in new window

This regular expression will do the following:

  • (?:[ -$&(-[\]-~]|([%'\\])\2)* will match all printable characters from space to ~, except %, \, ' unless they appear exactly twice
  • (%(\d+\$)?[-+\s0#]?(\d+|\*)?(\.\d+)?[bt]?[diuoxXfeEgGcs]+)+ is your expression
  • (?: starts the non-capture group
    • (?!(?:[ -$&(-[\]-~]|([%'\\])\7)*(?:%(?:\d+\$)?[-+\s0#]?(?:\d+|\*)?(?:\.\d+)?[bt]?[diuoxXfeEgGcs]+)+) looks ahead to see if there are more format strings
    • (?:[ -$&(-[\]-~]|([%'\\])\8)*)? if there weren't more format strings above, then this will capture the remaining printable characters
    • ) end of the capture group

Example

Live Demo

https://regex101.com/r/sV4eX3/2

Sample text

Color %s, we are looking for %%02droids %% number1 %d, number2 %05d, hex %#x, float %5.2f, unsigned value %u.

Sample Matches

MATCH 1
1.  [0-8]   `Color %s`
3.  [6-8]   `%s`

MATCH 2
1.  [8-53]  `, we are looking for %%02droids %% number1 %d`
2.  [40-41] `%`
3.  [51-53] `%d`

MATCH 3
1.  [53-67] `, number2 %05d`
3.  [63-67] `%05d`
5.  [65-66] `5`

MATCH 4
1.  [67-76] `, hex %#x`
3.  [73-76] `%#x`

MATCH 5
1.  [76-89] `, float %5.2f`
3.  [84-89] `%5.2f`
5.  [85-86] `5`
6.  [86-88] `.2`

MATCH 6
1.  [89-109]    `, unsigned value %u.`
3.  [106-108]   `%u`
like image 27
Ro Yo Mi Avatar answered Nov 19 '22 13:11

Ro Yo Mi