In a tab delimited text file, I would like to match only lines containing the "1" value right after the 24th tab.
Right now, the regex I have seems to match what I want, but breaks when the line doesn't match.
Could you help me improving it?
/(?:.+?\t){24}1/
INT E_63 0 0 u Le Le DET:ART DET le ?? ADJ SENT DET:ART NOM ADV SENT DET NOM 1 ?? ?? ?? ?? ?? 0 0 0 0 0 1 ?? ?? ?? ?? ?? ??
INT E_63 0 0 u Le Le DET:ART DET le ?? ADJ SENT DET:ART NOM ADV SENT DET NOM 1 ?? ?? ?? ?? ?? 0 0 0 0 0 0 ?? ?? ?? ?? ?? ??
(The first line should match, the second should not.)
Your regex does not work when there is no match due to catastrophic backtracking as .
also matches a tab character. Coupled with the fact that there are more subpatterns after the group with nested quantifiers, and absence of the ^
anchor, the catastrophic backtracking is imminent.
What you need is a negated character class [^\t]
and anchor the pattern at the start of the string:
/^(?:[^\t]*\t){24}1/
See the regex demo.
NOTE: To match the 1
as a whole word, you might consider adding \b
after it, or a lookahead (?!\S)
.
Details:
^
- start of a string(?:[^\t]*\t){24}
- 24 sequences of
[^\t]*
- 0+ chars other than a tab char\t
- a tab char1
- a 1
char.Instead of using regex you could just split it, check the 24th column at 23rd index and then use conditionals.
#!/usr/bin/perl
use strict;
use warnings;
open (my $fh, "<", '/path/to/tab_delem_file') or die "Could not open file $!";
while(<$fh>){
chomp;
my @line = split/\t/, $_; #split on tab
if ($line[23] == 1){
#do something
}
else ($line[23] == 1){
#do something else
}
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With