i have file contains urls plus params like following <pre class="prettyprint"><code>https://example.com/endpoint/?param1=123&param2=1212 https://example.com/endpoint/?param3=123&param1=98989 https://example.com/endpoint/endpoint3/?param2=123 https://example.com/endpoint/endpoint2/?param1=123 https://example.com/endpoint/endpoint2/ https://example.com/endpoint/endpoint5/&quot;//i.example.com/00/s/Nzk5WDEwMjQ=/z/47IAAOSwBu5hXIKF </code></pre> and i need to filter only urls with unique params the desired output <pre class="prettyprint"><code>http://example.com/endpoint/?param1=123&param2=1212 https://example.com/endpoint/?param3=123&param1=98989 https://example.com/endpoint/endpoint3/?param2=123 </code></pre> i managed to filter only urls with params with grep <code>grep -E '(\?[a-zA-Z0-9]{1,9}\=)'</code> but i need to filter params in the same time so i tried with awk with the same regex but it gives error <pre class="prettyprint"><code>awk '{sub(\?[a-zA-Z0-9]{1,9}\=)} !seen[$0]++' </code></pre> <h3>update</h3> i am sorry for editing the desired output but when i tried the scripts i figured out that their a lot of carbege in my file need to filter too. i tried @James Brown with some editing and it looks good till the end line it dose not filter it unfortunately <pre class="prettyprint"><code>awk -F '?|&' '$2&&!a[$2]++' </code></pre> and to be more clear why the that output is good for me it chosed the 1 st line because it has at least param1 2nd line because it has at least param3 3 line because it has at least param2 the comparison method here is choose just unique parameter whatever it concatenate with others with <code>&</code> char or not

Edited version after the reqs changes some: <pre class="prettyprint"><code>$ awk -F? '{ # ? as field delimiter split($2,b,/&/) # split at & to get whats between ? and & if(b[1]!=""&&!a[b[1]]++) # no ? means no $2 print }' file </code></pre> Output as expected. Original answer was: <s>A short one:</s> <pre class="prettyprint"><code>$ awk -F? '$2&&!a[$2]++' file </code></pre> Explained: Split records at <code>?</code> (<code>-F?</code>) and if there is a second field (<code>$2</code>) and (<code>&&</code>) it is unique this far by counting the instances of the parameters in the array <code>a</code> (<code>!a[$2]++</code>), output it.

EDIT: Following solution may help when query string has <code>?</code> as well as <code>&</code> present in it and we want to consider both of them for removing duplicates. <pre class="prettyprint"><code>awk ' /\?/{ match($0,/\?[^&]*/) val=substr($0,RSTART,RLENGTH) match($0,/&.*/) if(!seen[val]++ && !seen[substr($0,RSTART,RLENGTH)]++){ print } }' Input_file </code></pre> <hr> <hr> 2nd solution: (Following solution may help when we don't have <code>&</code> parameters in query string) With your shown samples, please try following <code>awk</code> program. <pre class="prettyprint"><code>awk 'match($0,/\?.*$/) && !seen[substr($0,RSTART,RLENGTH)]++' Input_file </code></pre> OR above could be shorten to as follows:(as per Ed sir's suggestions): <pre class="prettyprint"><code>awk 's=index($0,"?") && !seen[substr($0,s)]++' Input_file </code></pre> Explanation: Simple explanation would be, using <code>match</code> function of <code>awk</code> which matches everything from <code>?</code> to till end of line value. Then adding an AND condition to it to make sure we get only unique values out of all matched values in all lines.

With <code>gnu awk</code>, you could also match the url till the first occurrence of the question mark, and then capture what follows using your initial pattern for the first parameter <code>([a-zA-Z0-9]{1,9}=[^&]+)</code> followed by matching any character except the <code>&</code> Then you can use the <code>!seen[$0]++</code> part with the value of capture group 1. <pre class="prettyprint"><code>awk ' match($0, /https?:\/\/[^?]+\?([a-zA-Z0-9]{1,9}=[^&]+)/, arr) && !seen[arr[1]]++ ' file </code></pre> Output <pre class="prettyprint"><code>https://example.com/endpoint/?param1=123&param2=1212 https://example.com/endpoint/?param3=123&param1=98989 https://example.com/endpoint/endpoint3/?param2=123 </code></pre> <hr> Using <code>awk</code> you can check that the string starts with the protocol and contains a question mark. Then to get the first parameter only, you can split on <code>?</code> and <code>&</code> and use the second part of the split for <code>seen</code> <pre class="prettyprint"><code>awk ' /^https?:\/\/[^?]*\?/ && split($0, arr, /[?&]/) > 1 && !seen[arr[2]]++ ' file </code></pre>

filter unique parameters from file

Tags:

grep

awk

i have file contains urls plus params like following

https://example.com/endpoint/?param1=123&param2=1212
https://example.com/endpoint/?param3=123&param1=98989
https://example.com/endpoint/endpoint3/?param2=123
https://example.com/endpoint/endpoint2/?param1=123
https://example.com/endpoint/endpoint2/
https://example.com/endpoint/endpoint5/&quot;//i.example.com/00/s/Nzk5WDEwMjQ=/z/47IAAOSwBu5hXIKF

and i need to filter only urls with unique params the desired output

http://example.com/endpoint/?param1=123&param2=1212
https://example.com/endpoint/?param3=123&param1=98989
https://example.com/endpoint/endpoint3/?param2=123

i managed to filter only urls with params with grep grep -E '(\?[a-zA-Z0-9]{1,9}\=)'

but i need to filter params in the same time so i tried with awk with the same regex but it gives error

awk '{sub(\?[a-zA-Z0-9]{1,9}\=)} !seen[$0]++'

update

i am sorry for editing the desired output but when i tried the scripts i figured out that their a lot of carbege in my file need to filter too. i tried @James Brown with some editing and it looks good till the end line it dose not filter it unfortunately

awk -F '?|&' '$2&&!a[$2]++'

and to be more clear why the that output is good for me it chosed the 1 st line because it has at least param1 2nd line because it has at least param3 3 line because it has at least param2 the comparison method here is choose just unique parameter whatever it concatenate with others with & char or not

611

asked Oct 24 '21 14:10

Emad

Video Answer

3 Answers

Edited version after the reqs changes some:

$ awk -F? '{                   # ? as field delimiter
    split($2,b,/&/)            # split at & to get whats between ? and &
    if(b[1]!=""&&!a[b[1]]++)   # no ? means no $2
        print
}' file

Output as expected. Original answer was:

~~A short one:~~

$ awk -F? '$2&&!a[$2]++' file

Explained: Split records at ? (-F?) and if there is a second field ($2) and (&&) it is unique this far by counting the instances of the parameters in the array a (!a[$2]++), output it.

answered Nov 09 '22 16:11

James Brown

EDIT: Following solution may help when query string has ? as well as & present in it and we want to consider both of them for removing duplicates.

awk '
/\?/{
  match($0,/\?[^&]*/)
  val=substr($0,RSTART,RLENGTH)
  match($0,/&.*/)
  if(!seen[val]++ && !seen[substr($0,RSTART,RLENGTH)]++){
    print
  }
}' Input_file

2nd solution: (Following solution may help when we don't have & parameters in query string) With your shown samples, please try following awk program.

awk 'match($0,/\?.*$/) && !seen[substr($0,RSTART,RLENGTH)]++' Input_file

OR above could be shorten to as follows:(as per Ed sir's suggestions):

awk 's=index($0,"?") && !seen[substr($0,s)]++' Input_file

Explanation: Simple explanation would be, using match function of awk which matches everything from ? to till end of line value. Then adding an AND condition to it to make sure we get only unique values out of all matched values in all lines.

answered Nov 09 '22 18:11

RavinderSingh13

With gnu awk, you could also match the url till the first occurrence of the question mark, and then capture what follows using your initial pattern for the first parameter ([a-zA-Z0-9]{1,9}=[^&]+) followed by matching any character except the &

Then you can use the !seen[$0]++ part with the value of capture group 1.

awk '
match($0, /https?:\/\/[^?]+\?([a-zA-Z0-9]{1,9}=[^&]+)/, arr) && !seen[arr[1]]++
' file

Output

https://example.com/endpoint/?param1=123&param2=1212
https://example.com/endpoint/?param3=123&param1=98989
https://example.com/endpoint/endpoint3/?param2=123

Using awk you can check that the string starts with the protocol and contains a question mark.

Then to get the first parameter only, you can split on ? and & and use the second part of the split for seen

awk '
/^https?:\/\/[^?]*\?/ && split($0, arr, /[?&]/) > 1 && !seen[arr[2]]++
' file

answered Nov 09 '22 16:11

The fourth bird

Related questions
                            
                                extracting specific lines from a text file
                            
                                Can I grep for multiple patterns but have some be inverse? [duplicate]
                            
                                Calculate median of a sliding window with awk
                            
                                Is Awk and multiple file processing possible?
                            
                                How to insert a line in a file between two blocks of known lines (if not already inserted previously), using bash?
                            
                                Replacing specific characters in first column of text
                            
                                awk print vs printf functions
                            
                                Command to replace specific column of csv file for first 100 rows
                            
                                Convert exponentials and rounding numbers in BASH
                            
                                Move column to last in awk
                            
                                Word Count using AWK
                            
                                Regex replace on specific column with SED/AWK
                            
                                How can I skip line with awk
                            
                                Average of multiple files in shell
                            
                                Remove duplicate lines and overwrite file in same command
                            
                                Filter file with awk and keep header in output
                            
                                Extract email addresses from log with grep or sed
                            
                                Sum durations in bash
                            
                                Remove redundant strings without looping
                            
                                Counting unique occurrences in each column

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With