Remove duplicate lines based on partial text

Question

I have a long list of URLs stored in a text files which I will go through and download. But before doing this I want to remove the duplicate URLs from the list. One thing to note is that some of the URLs look different but infact lead to the same page. The unique elements in the URL (aside from the domain and path) are the first 2 parameters in the query string. So for example, my text file would look like this:

https://www.example.com/page1.html?id=12345&key=dnks93jd&user=399494&group=23
https://www.example.com/page1.html?id=15645&key=fkldf032&user=250643&group=12
https://www.example.com/page1.html?id=26327&key=xkd9c03n&user=399494&group=15
https://www.example.com/page1.html?id=12345&key=dnks93jd&user=454665&group=12

If a unique URL is defined up to the second query string (key) then lines 1 and 4 are a duplicate. I would like to completely remove the duplicate lines, so not even keeping one. In the example above, lines 2 and 3 would remain and the 1 and 4 get deleted.

How can I achieve this using basic command line tools?

Romeo Ninov · Accepted Answer

To shorten the code from other answer:

awk -F\& 'FNR == NR { url[$1,$2]++; next } url[$1,$2] == 1' urls.txt urls.txt

Shawn · Answer

Using awk:

$ awk -F'[?&]' 'FNR == NR { url[$1,$2,$3]++; next } url[$1,$2,$3] == 1' urls.txt urls.txt
https://www.example.com/page1.html?id=15645&key=fkldf032&user=250643&group=12
https://www.example.com/page1.html?id=26327&key=xkd9c03n&user=399494&group=15

Reads the file twice; first time to keep a count of how many times the bits you're interested in occur, the second time to print only those that showed up once.

Remove duplicate lines based on partial text

Tags:

linux

text

bash

unix

command-line

Matt9Atkins

2 Answers

Romeo Ninov

Shawn

Recent Activity

Donate For Us

Remove duplicate lines based on partial text

Tags:

linux

text

bash

unix

command-line

Matt9Atkins

2 Answers

Romeo Ninov

Shawn

Related questions

Recent Activity

Donate For Us