try to extract for substrings that match pattern in string. for example i have text like the one below
[ Pierre/NNP Vinken/NNP ]
,/,
[ 61/CD years/NNS ]
old/JJ ,/, will/MD join/VB
[ the/DT board/NN ]
as/IN
[ a/DT nonexecutive/JJ director/NN Nov./NNP 29/CD ]
./.
[ Mr./NNP Vinken/NNP ]
is/VBZ
[ chairman/NN ]
of/IN
and i want to extract whatever before slash (/) and whatever after slash, but somehow my regex extracts the first substring and ignore the rest of substrings in the line.
my output is something like this below :
tag:Pierre/NNP Vinken - word:Pierre/NNP Vinken/NNP ->1
tag:, - word:,/, ->1
tag:61/CD years - word:61/CD years/NNS ->1
tag:old/JJ ,/, will/MD join - word:old/JJ ,/, will/MD join/VB ->1
tag:the/DT board - word:the/DT board/NN ->1
tag:as - word:as/IN ->1
tag:a/DT nonexecutive/JJ director/NN Nov./NNP 29 - word:a/DT nonexecutive/JJ director/NN Nov./NNP 29/CD ->1
tag:. - word:./. ->1
tag:Mr./NNP Vinken - word:Mr./NNP Vinken/NNP ->1
tag:is - word:is/VBZ ->1
tag:chairman - word:chairman/NN ->1
tag:of - word:of/IN ->1
but what i am actually want is something like this below
tag:NNP - word:Pierre ->1
tag:NNP - word:Vinken ->1
tag:, - word:, ->1
tag:CD - word:61 ->1
.
.
etc.
code i used :
while (my $line = <$fh>) {
chomp $line;
#remove square brackets
$line=~s/[\[\]]//;
while($line =~m/((\s*(.*))\/((.*)\s+))/gi)
{
$word=$1;
$tag=$2;
#remove whitespace from left and right of string
$word=~ s/^\s+|\s+$//g;
$tag=~ s/^\s+|\s+$//g;
$tags{$tag}++;
$tagHash{$tag}{$word}++;
}
}
foreach my $str (sort keys %tagHash)
{
foreach my $s (keys %{$tagHash{$str}} )
{
print "tags:$str - word: $s-> $tagHash{$str}{$s}\n";
}
}
any idea why my regex does not behave as should be
EDIT:
in text files that i am parsing has wild character and punctuation as well, which is mean that files will have something like this : ''/'' "/" ,/, ./. ?/? !/! . . . etc
so i want to capture all of these things not only alphabetic and numeric characters.
I think you have tag/word
s that tag
and word
may be everything, except some characters like ],[,\s,
:
\s*([^\[\]\s]+?)\/([^\[\]\s]+)\s*
^^^^^^^^^1
This regex is similar to your original pattern. (See DEMO)
Description:
1- This Capturing Group match every character .
that is not [
,]
or \s
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With