Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

remove lines from output in bash that contains a huge amount of possibilities

Tags:

grep

sed

awk

I'm trying to filter lines for a big txt file (around 10GB) bashed on prefix of called number only when the direction column equals 2.

This is the format of the file that i'm getting from pipe (from a different script)

caller_number=34234234324, clear_number=982545345435, direction=1, ...
caller_number=83479234234, clear_number=348347384533, direction=2, ...

Of cause this is just an example data but the actual file contains many other columns but I want to only filter the clear_number column based on direction so this is enough.

I want to remove lines that do not contain a list of prefixes, so for example here I would do that with grep with the following:

grep -vP 'clear_number=(?!(2207891|22034418|22074450|220201677|220240574|220272183|220722988|220723276|220751152|220774457|220794227|220799141|2202000425|2202000939|2202000967)).*direction=2'

This works beautifully. The only problem is that the number of prefixes that I get is sometimes around 10K-50K, this is a lot of prefixes, and if I try to do that using grep I get grep: regular expression is too large.

Any ideas how else to resolve it using Bash commands?

update

example.. lets say i have the following:

caller_number=34234234324,     clear_number=982545345435, direction=1
caller_number=83479234234,     clear_number=348347384533, direction=2
caller_number=2342334324,      clear_number=5555345435,   direction=1
caller_number=034082394234324, clear_number=33335345435,  direction=1
caller_number=83479234234,     clear_number=348347384533, direction=2
caller_number=83479234234,     clear_number=444447384533, direction=2
caller_number=83479234234,     clear_number=64237384533, direction=2

and my list.txt contains:

642
3333
534234235

so it will only return the line

caller_number=83479234234,     clear_number=64237384533, direction=2

since clear number starts with 642 and direction=2. just in my case it will go over 10GB of text file and return at least 100K of results.

another update

i'm sorry i wasn't clear about one more thing. i get the lines from a pipe command, so i should do | awk... on the output i receive form previous commands.

like image 378
ufk Avatar asked Jun 08 '21 11:06

ufk


People also ask

How do you delete multiple lines in Bash?

:g/^#/d - Remove all comments from a Bash script. The pattern ^# means each line beginning with # . :g/^$/d - Remove all blank lines. The pattern ^$ matches all empty lines.

How do I remove a specific pattern from a Unix file?

N command reads the next line in the pattern space. d deletes the entire pattern space which contains the current and the next line. Using the substitution command s, we delete from the newline character till the end, which effective deletes the next line after the line containing the pattern Unix.


5 Answers

With your shown samples, please try following. Since OP has changed samples so adding code as per that now.

awk '
FNR==NR{
  arr[$0]
  next
}
match($0,/clear_number=[^,]*/){
  val=substr($0,RSTART+13,RLENGTH-13)
  for(i in arr){
    if(index(val,i)==1 && $NF=="direction=2,"){
      print
      next
    }
  }
}
' list.txt  Input_file

Explanation: Adding detailed explanation for above.

awk '                  ##Starting awk program from here.
FNR==NR{               ##Checking condition if FNR==NR which will be TRUE when list.txt is being read.
  arr[$0]              ##Creating arr array with index of current line.
  next                 ##next will skip all further statements from here.
}
match($0,/clear_number=[^,]*/){  ##Using match to match regex for clear_match till 1st occurrence of comma here.
  val=substr($0,RSTART+13,RLENGTH-13)  ##Creating val which has substring of matched regex.
  for(i in arr){       ##Traversing through arr here.
    if(index(val,i)==1 && $NF=="direction=2,"){ ##Checking condition of index AND last field is direction=2 then do following.
      print            ##Printing current line here.
      next             ##next will skip all further statements from here.
    }
  }
}
' list.txt  Input_file ##Mentioning Input_file names here.
like image 55
RavinderSingh13 Avatar answered Nov 15 '22 07:11

RavinderSingh13


You may try this awk also:

your_command |
awk '
FNR == NR {
   rexp["=" $1]
   next
}
$3 == "direction=2" {
   for (s in rexp)
      if (index($2, s)) {
         print
         next
      }
}' list.txt -

caller_number=83479234234,     clear_number=64237384533, direction=2
like image 38
anubhava Avatar answered Nov 15 '22 05:11

anubhava


You can use awk to read in the prefixes and filter out lines using

... | awk -F'[,=[:space:]]+' 'FNR==NR {hash[$0]; next} $6 == 2 {for (key in hash) { if (index($4, key) == 1) { print; next } }}' list.txt - > outputfile

The [,=[:space:]]+ is the field delimiter regex that matches one or more commas, equal signs and whitespace chars.

The FNR==NR {hash[$0]; next} parts reads in the contents of list.txt with prefixes, each on a separate line.

The $6 == 2 requires Field 6 (direction) to be equal to 2.

Then, {for (key in hash) { if (index($4, key) == 1) { print; next } }}' tries to find a hash value that is a prefix of current Field 4 and prints the line if found and proceeds to the next line.

like image 41
Wiktor Stribiżew Avatar answered Nov 15 '22 07:11

Wiktor Stribiżew


Closer to what you were originally doing -
(To be clear, this approach is probably not the best for such large datasets, but someone with smaller files might benefit.)

edit your list.txt to be patterns instead of just prefix strings.
If I use

clear_number=123.*direction=2
clear_number=03408.*direction=2
clear_number=4567890.*direction=2

and

caller_number=34234234321,     clear_number=982545345435, direction=1
caller_number=83479234232,     clear_number=123347384533, direction=2
caller_number=2342334323,      clear_number=5555345435,   direction=1
caller_number=834792394234324, clear_number=03408345435,  direction=1
caller_number=56779234235,     clear_number=348347384533, direction=2
caller_number=83479234236,     clear_number=456789084533, direction=2
caller_number=83479234237,     clear_number=64237384533,  direction=2

Then I get this:

$: grep -f list.txt x
caller_number=83479234232,     clear_number=123347384533, direction=2
caller_number=83479234236,     clear_number=456789084533, direction=2

So reversing the match -

$: grep -vf list.txt x
caller_number=34234234321,     clear_number=982545345435, direction=1
caller_number=2342334323,      clear_number=5555345435,   direction=1
caller_number=834792394234324, clear_number=03408345435,  direction=1
caller_number=56779234235,     clear_number=348347384533, direction=2
caller_number=83479234237,     clear_number=64237384533,  direction=2

Converting list.txt from

642
3333
534234235

to

clear_number=642.*direction=2
clear_number=3333.*direction=2
clear_number=534234235.*direction=2

only takes

$: sed -i.bak 's/^/clear_number=/; s/$/.*direction=2/;' list.txt

which will make a backup, too.

like image 41
Paul Hodges Avatar answered Nov 15 '22 07:11

Paul Hodges


Here's a much faster solution by changing how the inner loop works. This also uses code from RavinderSingh13 and Wiktor Stribiżew answers.

FNR==NR{ arr[$0]; next }

$3=="direction=2,"{
    val=substr($2,14)
    for(i=1; i<length(val); i++)
        if(substr(val,1,i) in arr){
            print
            next
        }
}
  • The inner loop goes over the digits of clear_number instead of looping over every key in arr. So, instead of looping 10K-50K times, you loop only upto length of the digits, which is about 12 max as per given samples.
    • First time, this loop will have one character from start, next time it will have two characters from the start and so on.
    • i<length(val) is used instead of i<=length(val) since the last character will be ,.
  • $3=="direction=2," is compared first (this saves all the looping if not matched)
  • match($0,/clear_number=[^,]*/) isn't needed because $2 already has this string

Save the above code as script.awk and use it as:

... | mawk -f script.awk list.txt

Note that I've also used mawk in the above code. This version of awk is less featured compared to say GNU awk but gives better performance. I checked the results with version 1.3.4 and it gave same result as GNU awk.

If you don't have mawk, then you can use LC_ALL=C awk instead of mawk in the above command. See What does LC_ALL=C do? for details.


Here's a sample timing result (using mawk):

$ wc data.txt
500000  1500000 36000000 data.txt
$ wc list.txt
12000 12000 73382 list.txt
  • 0m57.477s --> anubhava's solution, but with index($2,s) instead of $2 ~ s
  • 0m59.975s --> RavinderSingh13's solution, but with $NF=="direction=2," compared first
  • 1m1.578s --> Wiktor Stribiżew's solution
  • 0m0.271s --> this solution
like image 28
Sundeep Avatar answered Nov 15 '22 05:11

Sundeep