Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to extract the text between two patterns using REGEX perl

Tags:

regex

perl

In the following lines how can I store the lines between "Description:" and "Tag:" in a variable using REGEX PERL and what would be a good datatype to use, string or list or something else?

(I am trying to write a program in Perl to extract the information of a text file with Debian package information and convert it into a RDF(OWL) file(ontology).)

Description: library for decoding ATSC A/52 streams (development) liba52 is a free library for decoding ATSC A/52 streams. The A/52 standard is used in a variety of applications, including digital television and DVD. It is also known as AC-3.

This package contains the development files. Homepage: http://liba52.sourceforge.net/

Tag: devel::library, role::devel-lib

The code I have written so far is:

#!/usr/bin/perl
open(DEB,"Packages");
open(ONT,">>debianmodelling.txt");

$i=0;
while(my $line = <DEB>)
{

    if($line =~ /Package/)
    {
        $line =~ s/Package: //;
        print ONT '  <package rdf:ID="instance'.$i.'">';
        print ONT    '    <name rdf:datatype="http://www.w3.org/2001/XMLSchema#string">'.$line.'</name>'."\n";
    }
elsif($line =~ /Priority/)
{
    $line =~ s/Priority: //;
    print ONT '    <priority rdf:datatype="http://www.w3.org/2001/XMLSchema#string">'.$line.'</priority>'."\n";
}

elsif($line =~ /Section/)
{
    $line =~ s/Section: //;
    print ONT '    <Section rdf:datatype="http://www.w3.org/2001/XMLSchema#string">'.$line.'</Section>'."\n";
}

elsif($line =~ /Maintainer/)
{
    $line =~ s/Maintainer: //;
    print ONT '    <maintainer rdf:datatype="http://www.w3.org/2001/XMLSchema#string">'.$line.'</maintainer>'."\n";
}

elsif($line =~ /Architecture/)
{
    $line =~ s/Architecture: //;
    print ONT '    <architecture rdf:datatype="http://www.w3.org/2001/XMLSchema#string">'.$line.'</architecture>'."\n";
}
elsif($line =~ /Version/)
{
    $line =~ s/Version: //;
    print ONT '    <version rdf:datatype="http://www.w3.org/2001/XMLSchema#string">'.$line.'</version>'."\n";
}
elsif($line =~ /Provides/)
{
    $line =~ s/Provides: //;
    print ONT '    <provides rdf:datatype="http://www.w3.org/2001/XMLSchema#string">'.$line.'</provides>'."\n";
}
elsif($line =~ /Depends/)
{
    $line =~ s/Depends: //;
    print ONT '    <depends rdf:datatype="http://www.w3.org/2001/XMLSchema#string">'.$line.'</depends>'."\n";
}
elsif($line =~ /Suggests/)
{
    $line =~ s/Suggests: //;
    print ONT '    <suggests rdf:datatype="http://www.w3.org/2001/XMLSchema#string">'.$line.'</suggests>'."\n";
}

elsif($line =~ /Description/)
{
    $line =~ s/Description: //;
    print ONT '    <Description rdf:datatype="http://www.w3.org/2001/XMLSchema#string">'.$line.'</Description>'."\n";
}
elsif($line =~ /Tag/)
{
    $line =~ s/Tag: //;
    print ONT '    <Tag rdf:datatype="http://www.w3.org/2001/XMLSchema#string">'.$line.'</Tag>'."\n";
    print ONT '  </Package>'."\n\n";
}
$i=$i+1;
}
like image 428
codious Avatar asked Jun 04 '11 16:06

codious


1 Answers

my $desc = "Description:";
my $tag  = "Tag:";

$line =~ /$desc(.*?)$tag/;
my $matched = $1;
print $matched;

or


my $desc = "Description:";
my $tag  = "Tag:";

my @matched = $line =~ /$desc(.*?)$tag/;
print $matched[0];

or


my $desc = "Description:";
my $tag  = "Tag:";

(my $matched = $line) =~ s/$desc(.*?)$tag/$1/;
print $matched;

Additional


If your Description and Tag may be on separate lines, you may need to use the /s modifier, to treat it as a single line, so the \n won't wreck it. Example:

$_=qq{Description:foo 
      more description on 
      new line Tag: some
      tag};
s/Description:(.*?)Tag:/$1/s; #notice the trailing slash
print;
like image 196
vol7ron Avatar answered Oct 05 '22 12:10

vol7ron