i have file contains urls plus params like following
https://example.com/endpoint/?param1=123¶m2=1212
https://example.com/endpoint/?param3=123¶m1=98989
https://example.com/endpoint/endpoint3/?param2=123
https://example.com/endpoint/endpoint2/?param1=123
https://example.com/endpoint/endpoint2/
https://example.com/endpoint/endpoint5/"//i.example.com/00/s/Nzk5WDEwMjQ=/z/47IAAOSwBu5hXIKF
and i need to filter only urls with unique params the desired output
http://example.com/endpoint/?param1=123¶m2=1212
https://example.com/endpoint/?param3=123¶m1=98989
https://example.com/endpoint/endpoint3/?param2=123
i managed to filter only urls with params with grep
grep -E '(\?[a-zA-Z0-9]{1,9}\=)'
but i need to filter params in the same time so i tried with awk with the same regex but it gives error
awk '{sub(\?[a-zA-Z0-9]{1,9}\=)} !seen[$0]++'
i am sorry for editing the desired output but when i tried the scripts i figured out that their a lot of carbege in my file need to filter too. i tried @James Brown with some editing and it looks good till the end line it dose not filter it unfortunately
awk -F '?|&' '$2&&!a[$2]++'
and to be more clear why the that output is good for me
it chosed the 1 st line because it has at least param1
2nd line because it has at least param3
3 line because it has at least param2
the comparison method here is choose just unique parameter whatever it concatenate with others with &
char or not
In Excel, there are several ways to filter for unique values—or remove duplicate values: To filter for unique values, click Data > Sort & Filter > Advanced. To remove duplicate values, click Data > Data Tools > Remove Duplicates.
Get a list of unique values based on criteria. To extract unique values with condition, use the Excel UNIQUE and FILTER functions together: The FILTER function limits the data only to values that meet the condition. The UNIQUE function removes duplicates from the filtered list.
The Excel UNIQUE function returns a list of unique values in a list or range. Values can be text, numbers, dates, times, etc. The Excel FILTER function filters a range of data based on supplied criteria, and extracts matching records.
To extract a list of unique values from a set of data, while applying one or more logical criteria, you can use the UNIQUE function together with the FILTER function. In the example shown, the formula in D5 is: which returns the 3 unique colors in group B with a quantity > 5. This example uses the UNIQUE function together with the FILTER function.
Unique values with criteria. To extract a list of unique values from a set of data, while applying one or more logical criteria, you can use the UNIQUE function together with the FILTER function. In the example shown, the formula in D5 is: which outputs the 5 unique values in group A, as seen in E5:E9.
Note: uniq isn’t able to detect the duplicate lines unless they are adjacent to each other. The content in the file must be therefore sorted before using uniq or you can simply use sort -u instead of uniq command. Options For uniq Command: -c – -count : It tells how many times a line was repeated by displaying a number as a prefix with the line.
Edited version after the reqs changes some:
$ awk -F? '{ # ? as field delimiter
split($2,b,/&/) # split at & to get whats between ? and &
if(b[1]!=""&&!a[b[1]]++) # no ? means no $2
print
}' file
Output as expected. Original answer was:
A short one:
$ awk -F? '$2&&!a[$2]++' file
Explained: Split records at ?
(-F?
) and if there is a second field ($2
) and (&&
) it is unique this far by counting the instances of the parameters in the array a
(!a[$2]++
), output it.
EDIT: Following solution may help when query string has ?
as well as &
present in it and we want to consider both of them for removing duplicates.
awk '
/\?/{
match($0,/\?[^&]*/)
val=substr($0,RSTART,RLENGTH)
match($0,/&.*/)
if(!seen[val]++ && !seen[substr($0,RSTART,RLENGTH)]++){
print
}
}' Input_file
2nd solution: (Following solution may help when we don't have &
parameters in query string) With your shown samples, please try following awk
program.
awk 'match($0,/\?.*$/) && !seen[substr($0,RSTART,RLENGTH)]++' Input_file
OR above could be shorten to as follows:(as per Ed sir's suggestions):
awk 's=index($0,"?") && !seen[substr($0,s)]++' Input_file
Explanation: Simple explanation would be, using match
function of awk
which matches everything from ?
to till end of line value. Then adding an AND condition to it to make sure we get only unique values out of all matched values in all lines.
With gnu awk
, you could also match the url till the first occurrence of the question mark, and then capture what follows using your initial pattern for the first parameter ([a-zA-Z0-9]{1,9}=[^&]+)
followed by matching any character except the &
Then you can use the !seen[$0]++
part with the value of capture group 1.
awk '
match($0, /https?:\/\/[^?]+\?([a-zA-Z0-9]{1,9}=[^&]+)/, arr) && !seen[arr[1]]++
' file
Output
https://example.com/endpoint/?param1=123¶m2=1212
https://example.com/endpoint/?param3=123¶m1=98989
https://example.com/endpoint/endpoint3/?param2=123
Using awk
you can check that the string starts with the protocol and contains a question mark.
Then to get the first parameter only, you can split on ?
and &
and use the second part of the split for seen
awk '
/^https?:\/\/[^?]*\?/ && split($0, arr, /[?&]/) > 1 && !seen[arr[2]]++
' file
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With