Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Matlab Regexp for nested groups and captured tokens

Is there a way to capture tokens inside a non-captured group in Matlab regular expressions? Here is the specific problem:

InputString = 'Identifiers: 10  12  1   3   8   6   4   2'

Expression = 'Identifiers:\s(?:(\d*)\t?)+'
regexp(InputString, Expression, 'tokens')

I need to find the numbers after 'Identifier'. The string InputString is part of a big character array with lines before and after this line, separated by \r\n characters. The character after the colon is a whitespace, the numbers are seperated by tabs. The last number has no trailing tab. The number of numbers can vary, but it's always at least one and only integers with 1 or n digits.

I had the following idea in my Expression: Identify line by Identifiers:\s, find numbers with n>1 digits and captured token and possible trailing tab by (\d*)\t and repeat this 1 or more times by +. To repeat the digit part expression, I need to put it in a group. But I don't want to capture the token of the outer group (?:(\d*)\t?), but of course the token of the inner grouping (\d*). Thats why I used ?:. When I remove ?: only one token containing 1012138642 is returned.

Isn't it possible to capture tokens inside a non-capturing group? Do you have any solution to return the numbers in a single statement?

In my current solution I find the line by

Expression = 'Identifiers:.+?\r\n'
Line = regexp(InputString, Expression, 'match')

and get the digits with

regexp(Line, '(\d+)\t+', 'tokens')

I spend so much time finding a single statement solution, I now really need to know if it's possible or not! I am not sure if I am thinking wrong, my head is not working as intended or it's just impossible.

like image 644
Lenwo Avatar asked Nov 17 '25 06:11

Lenwo


1 Answers

MATLAB doesn't support nested tokens, even if you you mark them as non capturing.

Starting in 16b there are some new text manipulations that make this easier:

str = "Identifiers: 10  12  1   3   8   6   4   2" + newline + "Blah";

str = str.extractBetween("Identifiers: ",newline).split

str = 

  8×1 string array

    "10"
    "12"
    "1"
    "3"
    "8"
    "6"
    "4"
    "2"

If your goal is one statement with regexp, using split might get you closer.

str = regexp(str,'(?<=Identifiers[^\n]*)\s*(?=[^\n]*)','split')

str = 

  1×10 string array

    "Identifiers:"    "10"    "12"    "1"    "3"    "8"    "6"    "4"    "2"    "Blah"
like image 200
matlabbit Avatar answered Nov 18 '25 20:11

matlabbit



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!