I'm trying to filter lines for a big txt file (around 10GB) bashed on prefix of called number only when the <code>direction</code> column equals <code>2</code>. This is the format of the file that i'm getting from pipe (from a different script) <pre class="prettyprint"><code>caller_number=34234234324, clear_number=982545345435, direction=1, ... caller_number=83479234234, clear_number=348347384533, direction=2, ... </code></pre> Of cause this is just an example data but the actual file contains many other columns but I want to only filter the <code>clear_number</code> column based on <code>direction</code> so this is enough. I want to remove lines that do not contain a list of prefixes, so for example here I would do that with grep with the following: <pre class="prettyprint"><code>grep -vP 'clear_number=(?!(2207891|22034418|22074450|220201677|220240574|220272183|220722988|220723276|220751152|220774457|220794227|220799141|2202000425|2202000939|2202000967)).*direction=2' </code></pre> This works beautifully. The only problem is that the number of prefixes that I get is sometimes around 10K-50K, this is a lot of prefixes, and if I try to do that using <code>grep</code> I get <code>grep: regular expression is too large</code>. Any ideas how else to resolve it using Bash commands? update example.. lets say i have the following: <pre class="prettyprint"><code>caller_number=34234234324, clear_number=982545345435, direction=1 caller_number=83479234234, clear_number=348347384533, direction=2 caller_number=2342334324, clear_number=5555345435, direction=1 caller_number=034082394234324, clear_number=33335345435, direction=1 caller_number=83479234234, clear_number=348347384533, direction=2 caller_number=83479234234, clear_number=444447384533, direction=2 caller_number=83479234234, clear_number=64237384533, direction=2 </code></pre> and my <code>list.txt</code> contains: <pre class="prettyprint"><code>642 3333 534234235 </code></pre> so it will only return the line <pre class="prettyprint"><code>caller_number=83479234234, clear_number=64237384533, direction=2 </code></pre> since clear number starts with <code>642</code> and direction=<code>2</code>. just in my case it will go over 10GB of text file and return at least 100K of results. another update i'm sorry i wasn't clear about one more thing. i get the lines from a pipe command, so i should do <code>| awk...</code> on the output i receive form previous commands.

You can use <code>awk</code> to read in the prefixes and filter out lines using <pre class="prettyprint lang-sh prettyprint-override"><code>... | awk -F'[,=[:space:]]+' 'FNR==NR {hash[$0]; next} $6 == 2 {for (key in hash) { if (index($4, key) == 1) { print; next } }}' list.txt - > outputfile </code></pre> The <code>[,=[:space:]]+</code> is the field delimiter regex that matches one or more commas, equal signs and whitespace chars. The <code>FNR==NR {hash[$0]; next}</code> parts reads in the contents of <code>list.txt</code> with prefixes, each on a separate line. The <code>$6 == 2</code> requires Field 6 (direction) to be equal to <code>2</code>. Then, <code>{for (key in hash) { if (index($4, key) == 1) { print; next } }}'</code> tries to find a <code>hash</code> value that is a prefix of current Field 4 and prints the line if found and proceeds to the next line.

Closer to what you were originally doing - (To be clear, this approach is probably not the best for such large datasets, but someone with smaller files might benefit.) edit your <code>list.txt</code> to be patterns instead of just prefix strings. If I use <pre class="prettyprint"><code>clear_number=123.*direction=2 clear_number=03408.*direction=2 clear_number=4567890.*direction=2 </code></pre> and <pre class="prettyprint"><code>caller_number=34234234321, clear_number=982545345435, direction=1 caller_number=83479234232, clear_number=123347384533, direction=2 caller_number=2342334323, clear_number=5555345435, direction=1 caller_number=834792394234324, clear_number=03408345435, direction=1 caller_number=56779234235, clear_number=348347384533, direction=2 caller_number=83479234236, clear_number=456789084533, direction=2 caller_number=83479234237, clear_number=64237384533, direction=2 </code></pre> Then I get this: <pre class="prettyprint"><code>$: grep -f list.txt x caller_number=83479234232, clear_number=123347384533, direction=2 caller_number=83479234236, clear_number=456789084533, direction=2 </code></pre> So reversing the match - <pre class="prettyprint"><code>$: grep -vf list.txt x caller_number=34234234321, clear_number=982545345435, direction=1 caller_number=2342334323, clear_number=5555345435, direction=1 caller_number=834792394234324, clear_number=03408345435, direction=1 caller_number=56779234235, clear_number=348347384533, direction=2 caller_number=83479234237, clear_number=64237384533, direction=2 </code></pre> Converting <code>list.txt</code> from <pre class="prettyprint"><code>642 3333 534234235 </code></pre> to <pre class="prettyprint"><code>clear_number=642.*direction=2 clear_number=3333.*direction=2 clear_number=534234235.*direction=2 </code></pre> only takes <pre class="prettyprint"><code>$: sed -i.bak 's/^/clear_number=/; s/$/.*direction=2/;' list.txt </code></pre> which will make a backup, too.

Here's a much faster solution by changing how the inner loop works. This also uses code from RavinderSingh13 and Wiktor Stribiżew answers. <pre class="prettyprint"><code>FNR==NR{ arr[$0]; next } $3=="direction=2,"{ val=substr($2,14) for(i=1; i<length(val); i++) if(substr(val,1,i) in arr){ print next } } </code></pre> <ul> <li>The inner loop goes over the digits of <code>clear_number</code> instead of looping over every key in <code>arr</code>. So, instead of looping 10K-50K times, you loop only upto length of the digits, which is about 12 max as per given samples. <ul> <li>First time, this loop will have one character from start, next time it will have two characters from the start and so on.</li> <li> <code>i<length(val)</code> is used instead of <code>i<=length(val)</code> since the last character will be <code>,</code>.</li> </ul> </li> <li> <code>$3=="direction=2,"</code> is compared first (this saves all the looping if not matched)</li> <li> <code>match($0,/clear_number=[^,]*/)</code> isn't needed because <code>$2</code> already has this string</li> </ul> Save the above code as <code>script.awk</code> and use it as: <pre class="prettyprint lang-sh prettyprint-override"><code>... | mawk -f script.awk list.txt </code></pre> Note that I've also used <code>mawk</code> in the above code. This version of <code>awk</code> is less featured compared to say <code>GNU awk</code> but gives better performance. I checked the results with version <code>1.3.4</code> and it gave same result as <code>GNU awk</code>. If you don't have <code>mawk</code>, then you can use <code>LC_ALL=C awk</code> instead of <code>mawk</code> in the above command. See What does LC_ALL=C do? for details. Here's a sample timing result (using <code>mawk</code>): <pre class="prettyprint lang-sh prettyprint-override"><code>$ wc data.txt 500000 1500000 36000000 data.txt $ wc list.txt 12000 12000 73382 list.txt </code></pre> <ul> <li> <code>0m57.477s</code> --> anubhava's solution, but with <code>index($2,s)</code> instead of <code>$2 ~ s</code> </li> <li> <code>0m59.975s</code> --> RavinderSingh13's solution, but with <code>$NF=="direction=2,"</code> compared first</li> <li> <code>1m1.578s</code> --> Wiktor Stribiżew's solution</li> <li> <code>0m0.271s</code> --> this solution</li> </ul>

remove lines from output in bash that contains a huge amount of possibilities

Tags:

grep

sed

awk

I'm trying to filter lines for a big txt file (around 10GB) bashed on prefix of called number only when the direction column equals 2.

This is the format of the file that i'm getting from pipe (from a different script)

caller_number=34234234324, clear_number=982545345435, direction=1, ...
caller_number=83479234234, clear_number=348347384533, direction=2, ...

Of cause this is just an example data but the actual file contains many other columns but I want to only filter the clear_number column based on direction so this is enough.

I want to remove lines that do not contain a list of prefixes, so for example here I would do that with grep with the following:

grep -vP 'clear_number=(?!(2207891|22034418|22074450|220201677|220240574|220272183|220722988|220723276|220751152|220774457|220794227|220799141|2202000425|2202000939|2202000967)).*direction=2'

This works beautifully. The only problem is that the number of prefixes that I get is sometimes around 10K-50K, this is a lot of prefixes, and if I try to do that using grep I get grep: regular expression is too large.

Any ideas how else to resolve it using Bash commands?

update

example.. lets say i have the following:

caller_number=34234234324,     clear_number=982545345435, direction=1
caller_number=83479234234,     clear_number=348347384533, direction=2
caller_number=2342334324,      clear_number=5555345435,   direction=1
caller_number=034082394234324, clear_number=33335345435,  direction=1
caller_number=83479234234,     clear_number=348347384533, direction=2
caller_number=83479234234,     clear_number=444447384533, direction=2
caller_number=83479234234,     clear_number=64237384533, direction=2

and my list.txt contains:

642
3333
534234235

so it will only return the line

caller_number=83479234234,     clear_number=64237384533, direction=2

since clear number starts with 642 and direction=2. just in my case it will go over 10GB of text file and return at least 100K of results.

another update

i'm sorry i wasn't clear about one more thing. i get the lines from a pipe command, so i should do | awk... on the output i receive form previous commands.

378

asked Jun 08 '21 11:06

ufk

5 Answers

With your shown samples, please try following. Since OP has changed samples so adding code as per that now.

awk '
FNR==NR{
  arr[$0]
  next
}
match($0,/clear_number=[^,]*/){
  val=substr($0,RSTART+13,RLENGTH-13)
  for(i in arr){
    if(index(val,i)==1 && $NF=="direction=2,"){
      print
      next
    }
  }
}
' list.txt  Input_file

Explanation: Adding detailed explanation for above.

awk '                  ##Starting awk program from here.
FNR==NR{               ##Checking condition if FNR==NR which will be TRUE when list.txt is being read.
  arr[$0]              ##Creating arr array with index of current line.
  next                 ##next will skip all further statements from here.
}
match($0,/clear_number=[^,]*/){  ##Using match to match regex for clear_match till 1st occurrence of comma here.
  val=substr($0,RSTART+13,RLENGTH-13)  ##Creating val which has substring of matched regex.
  for(i in arr){       ##Traversing through arr here.
    if(index(val,i)==1 && $NF=="direction=2,"){ ##Checking condition of index AND last field is direction=2 then do following.
      print            ##Printing current line here.
      next             ##next will skip all further statements from here.
    }
  }
}
' list.txt  Input_file ##Mentioning Input_file names here.

answered Nov 15 '22 07:11

RavinderSingh13

You may try this awk also:

your_command |
awk '
FNR == NR {
   rexp["=" $1]
   next
}
$3 == "direction=2" {
   for (s in rexp)
      if (index($2, s)) {
         print
         next
      }
}' list.txt -

caller_number=83479234234,     clear_number=64237384533, direction=2

answered Nov 15 '22 05:11

anubhava

You can use awk to read in the prefixes and filter out lines using

... | awk -F'[,=[:space:]]+' 'FNR==NR {hash[$0]; next} $6 == 2 {for (key in hash) { if (index($4, key) == 1) { print; next } }}' list.txt - > outputfile

The [,=[:space:]]+ is the field delimiter regex that matches one or more commas, equal signs and whitespace chars.

The FNR==NR {hash[$0]; next} parts reads in the contents of list.txt with prefixes, each on a separate line.

The $6 == 2 requires Field 6 (direction) to be equal to 2.

Then, {for (key in hash) { if (index($4, key) == 1) { print; next } }}' tries to find a hash value that is a prefix of current Field 4 and prints the line if found and proceeds to the next line.

answered Nov 15 '22 07:11

Wiktor Stribiżew

Closer to what you were originally doing -
(To be clear, this approach is probably not the best for such large datasets, but someone with smaller files might benefit.)

edit your list.txt to be patterns instead of just prefix strings.
If I use

clear_number=123.*direction=2
clear_number=03408.*direction=2
clear_number=4567890.*direction=2

and

caller_number=34234234321,     clear_number=982545345435, direction=1
caller_number=83479234232,     clear_number=123347384533, direction=2
caller_number=2342334323,      clear_number=5555345435,   direction=1
caller_number=834792394234324, clear_number=03408345435,  direction=1
caller_number=56779234235,     clear_number=348347384533, direction=2
caller_number=83479234236,     clear_number=456789084533, direction=2
caller_number=83479234237,     clear_number=64237384533,  direction=2

Then I get this:

$: grep -f list.txt x
caller_number=83479234232,     clear_number=123347384533, direction=2
caller_number=83479234236,     clear_number=456789084533, direction=2

So reversing the match -

$: grep -vf list.txt x
caller_number=34234234321,     clear_number=982545345435, direction=1
caller_number=2342334323,      clear_number=5555345435,   direction=1
caller_number=834792394234324, clear_number=03408345435,  direction=1
caller_number=56779234235,     clear_number=348347384533, direction=2
caller_number=83479234237,     clear_number=64237384533,  direction=2

Converting list.txt from

642
3333
534234235

clear_number=642.*direction=2
clear_number=3333.*direction=2
clear_number=534234235.*direction=2

only takes

$: sed -i.bak 's/^/clear_number=/; s/$/.*direction=2/;' list.txt

which will make a backup, too.

answered Nov 15 '22 07:11

Paul Hodges

Here's a much faster solution by changing how the inner loop works. This also uses code from RavinderSingh13 and Wiktor Stribiżew answers.

FNR==NR{ arr[$0]; next }

$3=="direction=2,"{
    val=substr($2,14)
    for(i=1; i<length(val); i++)
        if(substr(val,1,i) in arr){
            print
            next
        }
}

The inner loop goes over the digits of clear_number instead of looping over every key in arr. So, instead of looping 10K-50K times, you loop only upto length of the digits, which is about 12 max as per given samples.
- First time, this loop will have one character from start, next time it will have two characters from the start and so on.
- i<length(val) is used instead of i<=length(val) since the last character will be ,.
$3=="direction=2," is compared first (this saves all the looping if not matched)
match($0,/clear_number=[^,]*/) isn't needed because $2 already has this string

Save the above code as script.awk and use it as:

... | mawk -f script.awk list.txt

Note that I've also used mawk in the above code. This version of awk is less featured compared to say GNU awk but gives better performance. I checked the results with version 1.3.4 and it gave same result as GNU awk.

If you don't have mawk, then you can use LC_ALL=C awk instead of mawk in the above command. See What does LC_ALL=C do? for details.

Here's a sample timing result (using mawk):

$ wc data.txt
500000  1500000 36000000 data.txt
$ wc list.txt
12000 12000 73382 list.txt

0m57.477s --> anubhava's solution, but with index($2,s) instead of $2 ~ s
0m59.975s --> RavinderSingh13's solution, but with $NF=="direction=2," compared first
1m1.578s --> Wiktor Stribiżew's solution
0m0.271s --> this solution

answered Nov 15 '22 05:11

Sundeep

Related questions
                            
                                Find specific pattern and print complete text block using awk or sed
                            
                                Take every nth row from a file with groups and n is a given in a column
                            
                                Deleting lines from a file with binary pattern strings
                            
                                Calling Awk in a shell script
                            
                                Add a column to any position in a file in unix [using awk or sed]
                            
                                Creating histograms in bash
                            
                                Generate random float number in given specific range of numbers using Bash
                            
                                sed-ish oneliner to perform arithmetic within substitution
                            
                                sed/awk to double space file
                            
                                Extract unique IPs from live tcpdump capture
                            
                                Counting unique values in a column with a shell script
                            
                                Removing duplicated lines from a txt file
                            
                                awk to compare two files [duplicate]
                            
                                Bash sqlite3 -line | How to convert to JSON format
                            
                                Unix one-liner to swap/transpose two lines in multiple text files?
                            
                                How to add html attributes and values for all lines quickly with vim and plugins?
                            
                                how to ignore blank lines and comment lines using awk
                            
                                Insert a line after pattern match, if only the string does not exists
                            
                                How to print a separator if value or two consecutive rows do not match for a column
                            
                                grep (awk) a file from A to first empty line

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

remove lines from output in bash that contains a huge amount of possibilities

Tags:

grep

sed

awk

ufk

People also ask

5 Answers

RavinderSingh13

anubhava

Wiktor Stribiżew

Paul Hodges

Sundeep

Recent Activity

Donate For Us