Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

extract substring from a string using regex in perl?

Tags:

regex

perl

try to extract for substrings that match pattern in string. for example i have text like the one below

[ Pierre/NNP Vinken/NNP ]
,/, 
[ 61/CD years/NNS ]
old/JJ ,/, will/MD join/VB 
[ the/DT board/NN ]
as/IN 
[ a/DT nonexecutive/JJ director/NN Nov./NNP 29/CD ]
./. 
[ Mr./NNP Vinken/NNP ]
is/VBZ 
[ chairman/NN ]
of/IN 

and i want to extract whatever before slash (/) and whatever after slash, but somehow my regex extracts the first substring and ignore the rest of substrings in the line.

my output is something like this below :

tag:Pierre/NNP Vinken - word:Pierre/NNP Vinken/NNP ->1
tag:, - word:,/, ->1
tag:61/CD years - word:61/CD years/NNS ->1
tag:old/JJ ,/, will/MD join - word:old/JJ ,/, will/MD join/VB ->1
tag:the/DT board - word:the/DT board/NN ->1
tag:as - word:as/IN ->1
tag:a/DT nonexecutive/JJ director/NN Nov./NNP 29 - word:a/DT nonexecutive/JJ director/NN Nov./NNP 29/CD ->1
tag:. - word:./. ->1
tag:Mr./NNP Vinken - word:Mr./NNP Vinken/NNP ->1
tag:is - word:is/VBZ ->1
tag:chairman - word:chairman/NN ->1
tag:of - word:of/IN ->1

but what i am actually want is something like this below

tag:NNP  - word:Pierre ->1
tag:NNP  - word:Vinken ->1
tag:,    - word:,      ->1
tag:CD   - word:61     ->1
.
.
etc.

code i used :

    while (my $line = <$fh>) {
        chomp $line;
        #remove square brackets
        $line=~s/[\[\]]//;

        while($line =~m/((\s*(.*))\/((.*)\s+))/gi)
        {
            $word=$1;
            $tag=$2;
            #remove whitespace from left and right of string
            $word=~ s/^\s+|\s+$//g;
            $tag=~ s/^\s+|\s+$//g;
            $tags{$tag}++;
            $tagHash{$tag}{$word}++;
        }

    }
foreach my $str (sort keys %tagHash)
{
    foreach my $s (keys %{$tagHash{$str}} )
    {
        print "tags:$str - word: $s-> $tagHash{$str}{$s}\n";
    }
}

any idea why my regex does not behave as should be

EDIT:

in text files that i am parsing has wild character and punctuation as well, which is mean that files will have something like this : ''/'' "/" ,/, ./. ?/? !/! . . . etc

so i want to capture all of these things not only alphabetic and numeric characters.

like image 442
kero Avatar asked Oct 18 '22 15:10

kero


1 Answers

I think you have tag/words that tag and word may be everything, except some characters like ],[,\s,:

\s*([^\[\]\s]+?)\/([^\[\]\s]+)\s*
    ^^^^^^^^^1

This regex is similar to your original pattern. (See DEMO)

Description:

1- This Capturing Group match every character . that is not [,] or \s

like image 192
MohaMad Avatar answered Oct 21 '22 00:10

MohaMad