In a tab delimited text file, I would like to match only lines containing the "1" value right after the 24th tab.
Right now, the regex I have seems to match what I want, but breaks when the line doesn't match.
Could you help me improving it?
/(?:.+?\t){24}1/  
INT E_63    0   0   u   Le  Le  DET:ART DET le  ??  ADJ SENT DET:ART NOM ADV    SENT DET NOM    1   ??  ??  ??  ??  ??  0   0   0   0   0   1   ??  ??  ??  ??  ??  ??  
INT E_63    0   0   u   Le  Le  DET:ART DET le  ??  ADJ SENT DET:ART NOM ADV    SENT DET NOM    1   ??  ??  ??  ??  ??  0   0   0   0   0   0   ??  ??  ??  ??  ??  ??  
(The first line should match, the second should not.)
Your regex does not work when there is no match due to catastrophic backtracking as . also matches a tab character. Coupled with the fact that there are more subpatterns after the group with nested quantifiers, and absence of the ^ anchor, the catastrophic backtracking is imminent.
What you need is a negated character class [^\t] and anchor the pattern at the start of the string:
/^(?:[^\t]*\t){24}1/
See the regex demo.
NOTE: To match the 1 as a whole word, you might consider adding \b after it, or a lookahead (?!\S).
Details:
^ - start of a string(?:[^\t]*\t){24} - 24 sequences of 
[^\t]* - 0+ chars other than a tab char\t  - a tab char1 - a 1 char.Instead of using regex you could just split it, check the 24th column at 23rd index and then use conditionals.
#!/usr/bin/perl
use strict;
use warnings;
open (my $fh, "<", '/path/to/tab_delem_file') or die "Could not open file $!";
while(<$fh>){
  chomp;
  my @line = split/\t/, $_; #split on tab
  if ($line[23] == 1){
      #do something
  }
  else ($line[23] == 1){
      #do something else
  }
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With