I'm trying to filter lines for a big txt file (around 10GB) bashed on prefix of called number only when the direction
column equals 2
.
This is the format of the file that i'm getting from pipe (from a different script)
caller_number=34234234324, clear_number=982545345435, direction=1, ...
caller_number=83479234234, clear_number=348347384533, direction=2, ...
Of cause this is just an example data but the actual file contains many other columns but I want to only filter the clear_number
column based on direction
so this is enough.
I want to remove lines that do not contain a list of prefixes, so for example here I would do that with grep with the following:
grep -vP 'clear_number=(?!(2207891|22034418|22074450|220201677|220240574|220272183|220722988|220723276|220751152|220774457|220794227|220799141|2202000425|2202000939|2202000967)).*direction=2'
This works beautifully. The only problem is that the number of prefixes that I get is sometimes around 10K-50K, this is a lot of prefixes, and if I try to do that using grep
I get grep: regular expression is too large
.
Any ideas how else to resolve it using Bash commands?
update
example.. lets say i have the following:
caller_number=34234234324, clear_number=982545345435, direction=1
caller_number=83479234234, clear_number=348347384533, direction=2
caller_number=2342334324, clear_number=5555345435, direction=1
caller_number=034082394234324, clear_number=33335345435, direction=1
caller_number=83479234234, clear_number=348347384533, direction=2
caller_number=83479234234, clear_number=444447384533, direction=2
caller_number=83479234234, clear_number=64237384533, direction=2
and my list.txt
contains:
642
3333
534234235
so it will only return the line
caller_number=83479234234, clear_number=64237384533, direction=2
since clear number starts with 642
and direction=2
. just in my case it will go over 10GB of text file and return at least 100K of results.
another update
i'm sorry i wasn't clear about one more thing. i get the lines from a pipe command, so i should do | awk...
on the output i receive form previous commands.
:g/^#/d - Remove all comments from a Bash script. The pattern ^# means each line beginning with # . :g/^$/d - Remove all blank lines. The pattern ^$ matches all empty lines.
N command reads the next line in the pattern space. d deletes the entire pattern space which contains the current and the next line. Using the substitution command s, we delete from the newline character till the end, which effective deletes the next line after the line containing the pattern Unix.
With your shown samples, please try following. Since OP has changed samples so adding code as per that now.
awk '
FNR==NR{
arr[$0]
next
}
match($0,/clear_number=[^,]*/){
val=substr($0,RSTART+13,RLENGTH-13)
for(i in arr){
if(index(val,i)==1 && $NF=="direction=2,"){
print
next
}
}
}
' list.txt Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking condition if FNR==NR which will be TRUE when list.txt is being read.
arr[$0] ##Creating arr array with index of current line.
next ##next will skip all further statements from here.
}
match($0,/clear_number=[^,]*/){ ##Using match to match regex for clear_match till 1st occurrence of comma here.
val=substr($0,RSTART+13,RLENGTH-13) ##Creating val which has substring of matched regex.
for(i in arr){ ##Traversing through arr here.
if(index(val,i)==1 && $NF=="direction=2,"){ ##Checking condition of index AND last field is direction=2 then do following.
print ##Printing current line here.
next ##next will skip all further statements from here.
}
}
}
' list.txt Input_file ##Mentioning Input_file names here.
You may try this awk
also:
your_command |
awk '
FNR == NR {
rexp["=" $1]
next
}
$3 == "direction=2" {
for (s in rexp)
if (index($2, s)) {
print
next
}
}' list.txt -
caller_number=83479234234, clear_number=64237384533, direction=2
You can use awk
to read in the prefixes and filter out lines using
... | awk -F'[,=[:space:]]+' 'FNR==NR {hash[$0]; next} $6 == 2 {for (key in hash) { if (index($4, key) == 1) { print; next } }}' list.txt - > outputfile
The [,=[:space:]]+
is the field delimiter regex that matches one or more commas, equal signs and whitespace chars.
The FNR==NR {hash[$0]; next}
parts reads in the contents of list.txt
with prefixes, each on a separate line.
The $6 == 2
requires Field 6 (direction) to be equal to 2
.
Then, {for (key in hash) { if (index($4, key) == 1) { print; next } }}'
tries to find a hash
value that is a prefix of current Field 4 and prints the line if found and proceeds to the next line.
Closer to what you were originally doing -
(To be clear, this approach is probably not the best for such large datasets, but someone with smaller files might benefit.)
edit your list.txt
to be patterns instead of just prefix strings.
If I use
clear_number=123.*direction=2
clear_number=03408.*direction=2
clear_number=4567890.*direction=2
and
caller_number=34234234321, clear_number=982545345435, direction=1
caller_number=83479234232, clear_number=123347384533, direction=2
caller_number=2342334323, clear_number=5555345435, direction=1
caller_number=834792394234324, clear_number=03408345435, direction=1
caller_number=56779234235, clear_number=348347384533, direction=2
caller_number=83479234236, clear_number=456789084533, direction=2
caller_number=83479234237, clear_number=64237384533, direction=2
Then I get this:
$: grep -f list.txt x
caller_number=83479234232, clear_number=123347384533, direction=2
caller_number=83479234236, clear_number=456789084533, direction=2
So reversing the match -
$: grep -vf list.txt x
caller_number=34234234321, clear_number=982545345435, direction=1
caller_number=2342334323, clear_number=5555345435, direction=1
caller_number=834792394234324, clear_number=03408345435, direction=1
caller_number=56779234235, clear_number=348347384533, direction=2
caller_number=83479234237, clear_number=64237384533, direction=2
Converting list.txt
from
642
3333
534234235
to
clear_number=642.*direction=2
clear_number=3333.*direction=2
clear_number=534234235.*direction=2
only takes
$: sed -i.bak 's/^/clear_number=/; s/$/.*direction=2/;' list.txt
which will make a backup, too.
Here's a much faster solution by changing how the inner loop works. This also uses code from RavinderSingh13 and Wiktor Stribiżew answers.
FNR==NR{ arr[$0]; next }
$3=="direction=2,"{
val=substr($2,14)
for(i=1; i<length(val); i++)
if(substr(val,1,i) in arr){
print
next
}
}
clear_number
instead of looping over every key in arr
. So, instead of looping 10K-50K times, you loop only upto length of the digits, which is about 12 max as per given samples.
i<length(val)
is used instead of i<=length(val)
since the last character will be ,
.$3=="direction=2,"
is compared first (this saves all the looping if not matched)match($0,/clear_number=[^,]*/)
isn't needed because $2
already has this stringSave the above code as script.awk
and use it as:
... | mawk -f script.awk list.txt
Note that I've also used mawk
in the above code. This version of awk
is less featured compared to say GNU awk
but gives better performance. I checked the results with version 1.3.4
and it gave same result as GNU awk
.
If you don't have mawk
, then you can use LC_ALL=C awk
instead of mawk
in the above command. See What does LC_ALL=C do? for details.
Here's a sample timing result (using mawk
):
$ wc data.txt
500000 1500000 36000000 data.txt
$ wc list.txt
12000 12000 73382 list.txt
0m57.477s
--> anubhava's solution, but with index($2,s)
instead of $2 ~ s
0m59.975s
--> RavinderSingh13's solution, but with $NF=="direction=2,"
compared first1m1.578s
--> Wiktor Stribiżew's solution0m0.271s
--> this solutionIf you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With