Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Perl - Regex to extract only the comma-separated strings

I have a question I am hoping someone could help with...

I have a variable that contains the content from a webpage (scraped using WWW::Mechanize).

The variable contains data such as these:

$var = "ewrfs sdfdsf cat_dog,horse,rabbit,chicken-pig"
$var = "fdsf iiukui aawwe dffg elephant,MOUSE_RAT,spider,lion-tiger hdsfds jdlkf sdf"
$var = "dsadp poids pewqwe ANTELOPE-GIRAFFE,frOG,fish,crab,kangaROO-KOALA sdfdsf hkew"

The only bits I am interested in from the above examples are:

@array = ("cat_dog","horse","rabbit","chicken-pig")
@array = ("elephant","MOUSE_RAT","spider","lion-tiger") 
@array = ("ANTELOPE-GIRAFFE","frOG","fish","crab","kangaROO-KOALA")

The problem I am having:

I am trying to extract only the comma-separated strings from the variables and then store these in an array for use later on.

But what is the best way to make sure that I get the strings at the start (ie cat_dog) and end (ie chicken-pig) of the comma-separated list of animals as they are not prefixed/suffixed with a comma.

Also, as the variables will contain webpage content, it is inevitable that there may also be instances where a commas is immediately succeeded by a space and then another word, as that is the correct method of using commas in paragraphs and sentences...

For example:

Saturn was long thought to be the only ringed planet, however, this is now known not to be the case. 
                                                     ^        ^
                                                     |        |
                                    note the spaces here and here

I am not interested in any cases where the comma is followed by a space (as shown above).

I am only interested in cases where the comma DOES NOT have a space after it (ie cat_dog,horse,rabbit,chicken-pig)

I have a tried a number of ways of doing this but cannot work out the best way to go about constructing the regular expression.

like image 611
yonetpkbji Avatar asked Dec 26 '22 04:12

yonetpkbji


2 Answers

How about

[^,\s]+(,[^,\s]+)+

which will match one or more characters that are not a space or comma [^,\s]+ followed by a comma and one or more characters that are not a space or comma, one or more times.

Further to comments

To match more than one sequence add the g modifier for global matching.
The following splits each match $& on a , and pushes the results to @matches.

my $str = "sdfds cat_dog,horse,rabbit,chicken-pig then some more pig,duck,goose";
my @matches;

while ($str =~ /[^,\s]+(,[^,\s]+)+/g) {
    push(@matches, split(/,/, $&));
}   

print join("\n",@matches),"\n";
like image 136
MikeM Avatar answered Jan 15 '23 03:01

MikeM


Though you can probably construct a single regex, a combination of regexs, splits, grep and map looks decently

my @array = map { split /,/ } grep { !/^,/ && !/,$/ && /,/ } split

Going from right to left:

  1. Split the line on spaces (split)
  2. Leave only elements having no comma at the either end but having one inside (grep)
  3. Split each such element into parts (map and split)

That way you can easily change the parts e.g. to eliminate two consecutive commas add && !/,,/ inside grep.

like image 25
pwes Avatar answered Jan 15 '23 05:01

pwes