In a tab delimited text file, I would like to match only lines containing the "1" value right after the 24th tab.
Right now, the regex I have seems to match what I want, but breaks when the line doesn't match.
Could you help me improving it?
/(?:.+?\t){24}1/
INT E_63 0 0 u Le Le DET:ART DET le ?? ADJ SENT DET:ART NOM ADV SENT DET NOM 1 ?? ?? ?? ?? ?? 0 0 0 0 0 1 ?? ?? ?? ?? ?? ??
INT E_63 0 0 u Le Le DET:ART DET le ?? ADJ SENT DET:ART NOM ADV SENT DET NOM 1 ?? ?? ?? ?? ?? 0 0 0 0 0 0 ?? ?? ?? ?? ?? ??
(The first line should match, the second should not.)
Your regex does not work when there is no match due to catastrophic backtracking as . also matches a tab character. Coupled with the fact that there are more subpatterns after the group with nested quantifiers, and absence of the ^ anchor, the catastrophic backtracking is imminent.
What you need is a negated character class [^\t] and anchor the pattern at the start of the string:
/^(?:[^\t]*\t){24}1/
See the regex demo.
NOTE: To match the 1 as a whole word, you might consider adding \b after it, or a lookahead (?!\S).
Details:
^ - start of a string(?:[^\t]*\t){24} - 24 sequences of
[^\t]* - 0+ chars other than a tab char\t - a tab char1 - a 1 char.Instead of using regex you could just split it, check the 24th column at 23rd index and then use conditionals.
#!/usr/bin/perl
use strict;
use warnings;
open (my $fh, "<", '/path/to/tab_delem_file') or die "Could not open file $!";
while(<$fh>){
chomp;
my @line = split/\t/, $_; #split on tab
if ($line[23] == 1){
#do something
}
else ($line[23] == 1){
#do something else
}
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With