Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I extract substrings from a string in Perl?

Tags:

string

regex

perl

Consider the following strings:

1) Scheme ID: abc-456-hu5t10 (High priority) *****

2) Scheme ID: frt-78f-hj542w (Balanced)

3) Scheme ID: 23f-f974-nm54w (super formula run) *****

and so on in the above format - the parts in bold are changes across the strings.

==> Imagine I've many strings of format Shown above. I want to pick 3 substrings (As shown in BOLD below) from the each of the above strings.

  • 1st substring containing the alphanumeric value (in eg above it's "abc-456-hu5t10")
  • 2nd substring containing the word (in eg above it's "High priority")
  • 3rd substring containing * (IF * is present at the end of the string ELSE leave it )

How do I pick these 3 substrings from each string shown above? I know it can be done using regular expressions in Perl... Can you help with this?

like image 671
stack_pointer is EXTINCT Avatar asked Sep 18 '09 11:09

stack_pointer is EXTINCT


People also ask

How do I extract a substring in Perl?

Perl | substr() function This function by default returns the remaining part of the string starting from the given index if the length is not specified. A replacement string can also be passed to the substr() function if you want to replace that part of the string with some other substring.

How do you get a substring from a string?

To extract a substring that begins at a specified character position and ends before the end of the string, call the Substring(Int32, Int32) method. This method does not modify the value of the current instance. Instead, it returns a new string that begins at the startIndex position in the current string.

How do I remove the first and last character from a string in Perl?

We can use the chop() method to remove the last character of a string in Perl. This method removes the last character present in a string. It returns the character that is removed from the string, as the return value.


4 Answers

You could do something like this:

my $data = <<END;
1) Scheme ID: abc-456-hu5t10 (High priority) *
2) Scheme ID: frt-78f-hj542w (Balanced)
3) Scheme ID: 23f-f974-nm54w (super formula run) *
END

foreach (split(/\n/,$data)) {
  $_ =~ /Scheme ID: ([a-z0-9-]+)\s+\(([^)]+)\)\s*(\*)?/ || next;
  my ($id,$word,$star) = ($1,$2,$3);
  print "$id $word $star\n";
}

The key thing is the Regular expression:

Scheme ID: ([a-z0-9-]+)\s+\(([^)]+)\)\s*(\*)?

Which breaks up as follows.

The fixed String "Scheme ID: ":

Scheme ID: 

Followed by one or more of the characters a-z, 0-9 or -. We use the brackets to capture it as $1:

([a-z0-9-]+)

Followed by one or more whitespace characters:

\s+

Followed by an opening bracket (which we escape) followed by any number of characters which aren't a close bracket, and then a closing bracket (escaped). We use unescaped brackets to capture the words as $2:

\(([^)]+)\)

Followed by some spaces any maybe a *, captured as $3:

\s*(\*)?
like image 159
Dave Webb Avatar answered Sep 20 '22 22:09

Dave Webb


You could use a regular expression such as the following:

/([-a-z0-9]+)\s*\((.*?)\)\s*(\*)?/

So for example:

$s = "abc-456-hu5t10 (High priority) *";
$s =~ /([-a-z0-9]+)\s*\((.*?)\)\s*(\*)?/;
print "$1\n$2\n$3\n";

prints

abc-456-hu5t10
High priority
*
like image 38
Greg Hewgill Avatar answered Sep 22 '22 22:09

Greg Hewgill


(\S*)\s*\((.*?)\)\s*(\*?)


(\S*)    picks up anything which is NOT whitespace
\s*      0 or more whitespace characters
\(       a literal open parenthesis
(.*?)    anything, non-greedy so stops on first occurrence of...
\)       a literal close parenthesis
\s*      0 or more whitespace characters
(\*?)    0 or 1 occurances of literal *
like image 23
Xetius Avatar answered Sep 20 '22 22:09

Xetius


Well, a one liner here:

perl -lne 'm|Scheme ID:\s+(.*?)\s+\((.*?)\)\s?(\*)?|g&&print "$1:$2:$3"' file.txt

Expanded to a simple script to explain things a bit better:

#!/usr/bin/perl -ln              

#-w : warnings                   
#-l : print newline after every print                               
#-n : apply script body to stdin or files listed at commandline, dont print $_           

use strict; #always do this.     

my $regex = qr{  # precompile regex                                 
  Scheme\ ID:      # to match beginning of line.                      
  \s+              # 1 or more whitespace                             
  (.*?)            # Non greedy match of all characters up to         
  \s+              # 1 or more whitespace                             
  \(               # parenthesis literal                              
    (.*?)            # non-greedy match to the next                     
  \)               # closing literal parenthesis                      
  \s*              # 0 or more whitespace (trailing * is optional)    
  (\*)?            # 0 or 1 literal *s                                
}x;  #x switch allows whitespace in regex to allow documentation.   

#values trapped in $1 $2 $3, so do whatever you need to:            
#Perl lets you use any characters as delimiters, i like pipes because                    
#they reduce the amount of escaping when using file paths           
m|$regex| && print "$1 : $2 : $3";

#alternatively if(m|$regex|) {doOne($1); doTwo($2) ... }     

Though if it were anything other than formatting, I would implement a main loop to handle files and flesh out the body of the script rather than rely ing on the commandline switches for the looping.

like image 29
liam Avatar answered Sep 19 '22 22:09

liam