Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to remove trailing comments via regexp?

For non-MATLAB-savvy readers: not sure what family they belong to, but the MATLAB regexes are described here in full detail. MATLAB's comment character is % (percent) and its string delimiter is ' (apostrophe). A string delimiter inside a string is written as a double-apostophe ('this is how you write "it''s" in a string.'). To complicate matters more, the matrix transpose operators are also apostrophes (A' (Hermitian) or A.' (regular)).

Now, for dark reasons (that I will not elaborate on :), I'm trying to interpret MATLAB code in MATLAB's own language.

Currently I'm trying to remove all trailing comments in a cell-array of strings, each containing a line of MATLAB code. At first glance, this might seem simple:

>> str = 'simpleCommand(); % simple trailing comment';
>> regexprep(str, '%.*$', '')
ans =
    simpleCommand(); 

But of course, something like this might come along:

>> str = ' fprintf(''%d%*c%3.0f\n'', value, args{:}); % Let''s do this! ';
>> regexprep(str, '%.*$', '') 
ans = 
    fprintf('        %//   <-- WRONG!

Obviously, we need to exclude all comment characters that reside inside strings from the match, while also taking into account that a single apostrophe (or a dot-aposrotphe) directly following a statement is an operator, not a string delimiter.

Based on the assumption that the amount of string opening/closing characters before the comment character must be even (which I know is incomplete, because of the matrix-transpose operator), I conjured up the following dynamic regex to handle this sort of case:

>> str = {
       'myFun( {''test'' ''%''}); % let''s '                 
       'sprintf(str, ''%*8.0f%*s%c%3d\n''); % it''s '        
       'sprintf(str, ''%*8.0f%*s%c%3d\n''); % let''s '       
       'sprintf(str, ''%*8.0f%*s%c%3d\n'');  '
       'A = A.'';%tight trailing comment'
   };
>> 
>> C = regexprep(str, '(^.*)(?@mod(sum(\1==''''''''),2)==0;)(%.*$)', '$1')

However,

C = 
    'myFun( {'test' '%'}); '              %// sucess
    'sprintf(str, '%*8.0f%*s%c%3d\n'); '  %// sucess
    'sprintf(str, '%*8.0f%*s%c%3d\n'); '  %// sucess
    'sprintf(str, '%*8.0f%*s%c'           %// FAIL
    'A = A.';'                            %// success (although I'm not sure why)

so I'm almost there, but not quite yet :)

Unfortunately I've exhausted the amount of time I can spend thinking about this and need to continue with other things, so perhaps someone else who has more time is friendly enough to think about these questions:

  1. Are comment characters inside strings the only exception I need to look out for?
  2. What is the correct and/or more efficient way to do this?
like image 542
Rody Oldenhuis Avatar asked Jun 28 '13 07:06

Rody Oldenhuis


2 Answers

How do you feel about using undocumented features? If you dont object, you can use the mtree function to parse the code and strip the comments. No regexps involved, and we all know that we shouldn't try to parse context-free grammars using regular expressions.

This function is a full parser of MATLAB code written in pure M-code. As far as I can tell, it is an experimental implementation, but it's already used by Mathworks in a few places (this is the same function used by MATLAB Cody and Contests to measure code length), and can be used for other useful things.

If the input is a cellarray of strings, we do:

>> str = {..};
>> C = deblank(cellfun(@(s) tree2str(mtree(s)), str, 'UniformOutput',false))
C = 
    'myFun( { 'test', '%' } );'
    'sprintf( str, '%*8.0f%*s%c%3d\n' );'
    'sprintf( str, '%*8.0f%*s%c%3d\n' );'
    'sprintf( str, '%*8.0f%*s%c%3d\n' );'
    'A = A.';'

If you already have an M-file stored on disk, you can strip the comments simply as:

s = tree2str(mtree('myfile.m', '-file'))

If you want to see the comments back, add: mtree(.., '-comments')

like image 188
Amro Avatar answered Oct 28 '22 18:10

Amro


This matches conjugate transpose case by checking what characters are allowed before one

  1. Numbers 2'
  2. Letters A'
  3. Dot A.'
  4. Left parenthesis, brace and bracket A(1)', A{1}' and [1 2 3]'

These are the only cases I can think of now.

C = regexprep(str, '^(([^'']*''[^'']*''|[^'']*[\.a-zA-Z0-9\)\}\]]''[^'']*)*[^'']*)%.*$', '$1')

on your example we it returns

>> C = regexprep(str, '^(([^'']*''[^'']*''|[^'']*[\.a-zA-Z0-9\)\}\]]''[^'']*)*[^'']*)%.*$', '$1')

C = 

    'myFun( {'test' '%'}); '
    'sprintf(str, '%*8.0f%*s%c%3d\n'); '
    'sprintf(str, '%*8.0f%*s%c%3d\n'); '
    'sprintf(str, '%*8.0f%*s%c%3d\n');  '
    'A = A.';'
like image 44
Mohsen Nosratinia Avatar answered Oct 28 '22 18:10

Mohsen Nosratinia