Many of the functions in MATLAB (and also other language that have a c-derived scanf
/printf
) used for writing or reading strings (to name a few: sscanf
, sprintf
, textscan
) rely on the user supplying a valid formatSpec
string which tells the function the structure of the string-to-build or the string-to-parse. I'm looking for a way to validate such a formatSpec
string before using it in a call to sprintf
.
In the case of sprintf
, the structure of formatSpec
is described in the documentation and is as follows:
Specifically I'd like to point out two aspects of formatSpec
:
- (✓) A formatting operator starts with a percent sign,
%
, and ends with a conversion character.- (x)
formatSpec
can also include additional text before a percent sign,%
, or after a conversion character.
The solution I was thinking about involves using a regular expression to test the passed-in string. What I have so far is an expression that seems to be able to match everything between the initial %
and the conversion character, but not the "additional text" that may appear.
(%{1}(\d+\$)?[-+\s0#]*(\d+|\*)?(\.\d+)?[bt]?[diuoxXfeEgGcs]+)+
I wanted to also add the ability to capture "any printable text characters besides %
, '
and \
, unless these characters appear exactly twice". This needs to be captured both before the initial %
and after the conversion character.
[ -~]
%
, '
and \
: (?![\\%'])
( §§§§ |'{2}|\\{2}|%{2})
(§ = placeholder)I am having a problem with the "unless", that is, getting the negative look-ahead to discard single occurrences but allow double occurrences of the specified characters.
formatSpec
strings (i.e. w/o regex, or using a better one)?Disambiguation note: in the case of a formatSpec
string that has "free text" on both sides of formatting operators, the text should be considered a part of the next formatting operator, unless there are none left. Below is an example of how a formatSpec
string should be split
using the regex (where |
is the first char of each match):
Color %s, number1 %d, number2 %05d, hex %#x, float %5.2f, unsigned value %u.
| | | | | | |
I've spent a bit of time on this, and I think I'm close, so I'll write up my current progress in an answer. I'm fairly certain it could still be improved.
First, the code, using the nice example string by Ro Yo Mi:
% valid input
sample_good = 'Color %s, we are looking for %%02droids %% number1 %d, number2 %05d, hex %#x, float %5.2f, unsigned value %u.';
% invalid input: "%02 droids" has a single percent sign which is not part of an operator
sample_bad = 'Color %s, we are looking for %02 droids %% number1 %d, number2 %05d, hex %#x, float %5.2f, unsigned value %u.';
group_from = '(';
group_to = ')';
printable = '([ -$&-\[\]-~]|%%|\\\\)*';
atomic_op = '(?<!%)%(\d+\$)?[ +#-]*(\d+|\*)?(\.\d*)?[bt]?[diuoxXfeEgGcs]';
% pattern for full validation
full_patt = ['^' group_from printable atomic_op group_to '*' printable '$'];
% pattern for splitting valid strings
part_patt = [printable atomic_op];
% examples
matches_full_bad = regexp(sample_bad,full_patt); % no match
matches_full_good = regexp(sample_good,full_patt); % match
matches_parts_good = regexp(sample_good,part_patt,'match'); % sliced matches
The first example string is valid, the second is broken due to %02 droids
being part of the string. I defined a few auxiliary patterns; note that most of these have groups in them already. The printable
pattern uses everything ASCII except %
and \
, plus %%
and \\
. Note that in order to match a double backslash, we need four backshlashes (two escaped backslashes) in the search expression.
What I call atomic_op
is a pattern that matches format operators starting with a percent sign and ending with a conversion character. It uses negative lookbehind to avoid matching fake format operators starting with %%
. I took some shortcuts due to laziness (such as te
being valid in my version). It should be quite functional for not-too-evil inputs.
The most important parts are full_patt
and part_patt
. The former tries to match a full format spec, in order to determine if it's valid. Unfortunately, in case of nested groups MATLAB only stores tokens for the outermost level; in our case this would not be useful. This is where part_patt
comes into play: it only matches "printable_string format_operator". Used together with full_patt
it can be used to slice up your full string to meaningful contributions. Note that part_patt
will often match an invalid string too at its locally-valid positions, so the two really have to be used together.
Consider the specific example above:
>> matches_full_bad
matches_full_bad =
[]
>> matches_full_good
matches_full_good =
1
>> matches_parts_good{:}
ans =
Color %s
ans =
, we are looking for %%02droids %% number1 %d
ans =
, number2 %05d
ans =
, hex %#x
ans =
, float %5.2f
ans =
, unsigned value %u
Let's analyse the results. The "bad" pattern returns a (falsy) empty vector, while the "good" pattern returns a (truthy) 1
for the full pattern. The partial pattern then returns correctly each relevant subpattern of the input. Note, however, that the final period at the end of the sentence is missing from the result, since we matched blocks of printable atomic_op
. Since we know that we're working with a valid string, the rest of the string (after the final match) should be assign to either a new match, or to the final one, depending on your preference.
Just for clarity, here's how I imagine this to work:
for sample={sample_bad,sample_good},
if regexp(sample{1},full_patt)
disp('Match found!');
matches = regexp(sample,part_patt,'match');
matches = matches{1}; % strip outermost singleton cell dimension
for k=1:length(matches)
fprintf('Format substring #%d: %s\n',k, matches{k});
end
%TODO: treat final printable part of the string
else
disp('Uh-oh, no match!')
end
end
((?:[ -$&(-[\]-~]|([%'\\])\2)*(%(\d+\$)?[-+\s0#]?(\d+|\*)?(\.\d+)?[bt]?[diuoxXfeEgGcs]+)+(?:(?!(?:[ -$&(-[\]-~]|([%'\\])\7)*(?:%(?:\d+\$)?[-+\s0#]?(?:\d+|\*)?(?:\.\d+)?[bt]?[diuoxXfeEgGcs]+)+)(?:[ -$&(-[\]-~]|([%'\\])\8)*)?)
** To see the image better, simply right click the image and select view in new window
This regular expression will do the following:
(?:[ -$&(-[\]-~]|([%'\\])\2)*
will match all printable characters from space to ~
, except %
, \
, '
unless they appear exactly twice(%(\d+\$)?[-+\s0#]?(\d+|\*)?(\.\d+)?[bt]?[diuoxXfeEgGcs]+)+
is your expression(?:
starts the non-capture group
(?!(?:[ -$&(-[\]-~]|([%'\\])\7)*(?:%(?:\d+\$)?[-+\s0#]?(?:\d+|\*)?(?:\.\d+)?[bt]?[diuoxXfeEgGcs]+)+)
looks ahead to see if there are more format strings(?:[ -$&(-[\]-~]|([%'\\])\8)*)?
if there weren't more format strings above, then this will capture the remaining printable characters)
end of the capture groupLive Demo
https://regex101.com/r/sV4eX3/2
Sample text
Color %s, we are looking for %%02droids %% number1 %d, number2 %05d, hex %#x, float %5.2f, unsigned value %u.
Sample Matches
MATCH 1
1. [0-8] `Color %s`
3. [6-8] `%s`
MATCH 2
1. [8-53] `, we are looking for %%02droids %% number1 %d`
2. [40-41] `%`
3. [51-53] `%d`
MATCH 3
1. [53-67] `, number2 %05d`
3. [63-67] `%05d`
5. [65-66] `5`
MATCH 4
1. [67-76] `, hex %#x`
3. [73-76] `%#x`
MATCH 5
1. [76-89] `, float %5.2f`
3. [84-89] `%5.2f`
5. [85-86] `5`
6. [86-88] `.2`
MATCH 6
1. [89-109] `, unsigned value %u.`
3. [106-108] `%u`
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With