Regex Group in Perl: how to capture elements into array from regex group that matches unknown number of/multiple/variable occurrences from a string?

Tags:

In Perl, how can I use one regex grouping to capture more than one occurrence that matches it, into several array elements?

For example, for a string:

var1=100 var2=90 var5=hello var3="a, b, c" var7=test var3=hello

to process this with code:

$string = "var1=100 var2=90 var5=hello var3=\"a, b, c\" var7=test var3=hello";  my @array = $string =~ <regular expression here>  for ( my $i = 0; $i < scalar( @array ); $i++ ) {   print $i.": ".$array[$i]."\n"; }

I would like to see as output:

0: var1=100 1: var2=90 2: var5=hello 3: var3="a, b, c" 4: var7=test 5: var3=hello

What would I use as a regex?

The commonality between things I want to match here is an assignment string pattern, so something like:

my @array = $string =~ m/(\w+=[\w\"\,\s]+)*/;

Where the * indicates one or more occurrences matching the group.

(I discounted using a split() as some matches contain spaces within themselves (i.e. var3...) and would therefore not give desired results.)

With the above regex, I only get:

0: var1=100 var2

Is it possible in a regex? Or addition code required?

Looked at existing answers already, when searching for "perl regex multiple group" but not enough clues:

Dealing with multiple capture groups in multiple records
Multiple matches within a regex group?
Regex: Repeated capturing groups
Regex match and grouping
How do I regex match with grouping with unknown number of groups
awk extract multiple groups from each line
Matching multiple regex groups and removing them
Perl: Deleting multiple reccuring lines where a certain criterion is met
Regex matching into multiple groups per line?
PHP RegEx Grouping Multiple Matches
How to find multiple occurrences with regex groups?

708

asked Aug 11 '10 14:08

therobyouknow

2 Answers

my $string = "var1=100 var2=90 var5=hello var3=\"a, b, c\" var7=test var3=hello";  while($string =~ /(?:^|\s+)(\S+)\s*=\s*("[^"]*"|\S*)/g) {         print "<$1> => <$2>\n"; }

Prints:

<var1> => <100> <var2> => <90> <var5> => <hello> <var3> => <"a, b, c"> <var7> => <test> <var3> => <hello>

Explanation:

Last piece first: the g flag at the end means that you can apply the regex to the string multiple times. The second time it will continue matching where the last match ended in the string.

Now for the regex: (?:^|\s+) matches either the beginning of the string or a group of one or more spaces. This is needed so when the regex is applied next time, we will skip the spaces between the key/value pairs. The ?: means that the parentheses content won't be captured as group (we don't need the spaces, only key and value). \S+ matches the variable name. Then we skip any amount of spaces and an equal sign in between. Finally, ("[^"]*"|\S*)/ matches either two quotes with any amount of characters in between, or any amount of non-space characters for the value. Note that the quote matching is pretty fragile and won't handle escpaped quotes properly, e.g. "\"quoted\"" would result in "\".

EDIT:

Since you really want to get the whole assignment, and not the single keys/values, here's a one-liner that extracts those:

my @list = $string =~ /(?:^|\s+)((?:\S+)\s*=\s*(?:"[^"]*"|\S*))/g;

answered Sep 20 '22 05:09

jkramer

With regular expressions, use a technique that I like to call tack-and-stretch: anchor on features you know will be there (tack) and then grab what's between (stretch).

In this case, you know that a single assignment matches

\b\w+=.+

and you have many of these repeated in $string. Remember that \b means word boundary:

A word boundary (\b) is a spot between two characters that has a \w on one side of it and a \W on the other side of it (in either order), counting the imaginary characters off the beginning and end of the string as matching a \W.

The values in the assignments can be a little tricky to describe with a regular expression, but you also know that each value will terminate with whitespace—although not necessarily the first whitespace encountered!—followed by either another assignment or end-of-string.

To avoid repeating the assertion pattern, compile it once with qr// and reuse it in your pattern along with a look-ahead assertion (?=...) to stretch the match just far enough to capture the entire value while also preventing it from spilling into the next variable name.

Matching against your pattern in list context with m//g gives the following behavior:

The /g modifier specifies global pattern matching—that is, matching as many times as possible within the string. How it behaves depends on the context. In list context, it returns a list of the substrings matched by any capturing parentheses in the regular expression. If there are no parentheses, it returns a list of all the matched strings, as if there were parentheses around the whole pattern.

The pattern $assignment uses non-greedy .+? to cut off the value as soon as the look-ahead sees another assignment or end-of-line. Remember that the match returns the substrings from all capturing subpatterns, so the look-ahead's alternation uses non-capturing (?:...). The qr//, in contrast, contains implicit capturing parentheses.

#! /usr/bin/perl  use warnings; use strict;  my $string = <<'EOF'; var1=100 var2=90 var5=hello var3="a, b, c" var7=test var3=hello EOF  my $assignment = qr/\b\w+ = .+?/x; my @array = $string =~ /$assignment (?= \s+ (?: $ | $assignment))/gx;  for ( my $i = 0; $i < scalar( @array ); $i++ ) {   print $i.": ".$array[$i]."\n"; }

Output:

0: var1=100 1: var2=90 2: var5=hello 3: var3="a, b, c" 4: var7=test 5: var3=hello

answered Sep 17 '22 05:09

Greg Bacon

Related questions
                            
                                Disable Unnecessary escape character: \/ no-useless-escape
                            
                                Adding characters at the start and end of each line in a file
                            
                                How do I remove first 5 characters in each line in a text file using vi?
                            
                                How can I delete special characters?
                            
                                Get the string within brackets in Python
                            
                                Remove C and C++ comments using Python?
                            
                                Regex pattern for HH:MM:SS time string
                            
                                Find the index of the first digit in a string
                            
                                Regex JavaScript image file extension [closed]
                            
                                How to create a regex for accepting only alphanumeric characters? [duplicate]
                            
                                How can I replace multiple empty lines with a single empty line in bash?
                            
                                Regex for allowing alphanumeric,-,_ and space
                            
                                Replacing accented characters with plain ascii ones [duplicate]
                            
                                Worst Case Analysis for Regular Expressions
                            
                                Fuzzy Regular Expressions
                            
                                Backreferences Syntax in Replacement Strings (Why Dollar Sign?)
                            
                                How to wrap part of a text in a node with JavaScript
                            
                                How can I match overlapping strings with regex?
                            
                                Variable-length lookbehind-assertion alternatives for regular expressions
                            
                                converting RegExp to String then back to RegExp

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Regex Group in Perl: how to capture elements into array from regex group that matches unknown number of/multiple/variable occurrences from a string?

Tags:

regex

match

perl

grouping

therobyouknow

People also ask

2 Answers

jkramer

Greg Bacon

Recent Activity

Donate For Us