Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove duplicate lines based on partial text

I have a long list of URLs stored in a text files which I will go through and download. But before doing this I want to remove the duplicate URLs from the list. One thing to note is that some of the URLs look different but infact lead to the same page. The unique elements in the URL (aside from the domain and path) are the first 2 parameters in the query string. So for example, my text file would look like this:

https://www.example.com/page1.html?id=12345&key=dnks93jd&user=399494&group=23
https://www.example.com/page1.html?id=15645&key=fkldf032&user=250643&group=12
https://www.example.com/page1.html?id=26327&key=xkd9c03n&user=399494&group=15
https://www.example.com/page1.html?id=12345&key=dnks93jd&user=454665&group=12

If a unique URL is defined up to the second query string (key) then lines 1 and 4 are a duplicate. I would like to completely remove the duplicate lines, so not even keeping one. In the example above, lines 2 and 3 would remain and the 1 and 4 get deleted.

How can I achieve this using basic command line tools?

like image 881
Matt9Atkins Avatar asked Jan 20 '26 22:01

Matt9Atkins


2 Answers

To shorten the code from other answer:

awk -F\& 'FNR == NR { url[$1,$2]++; next } url[$1,$2] == 1' urls.txt urls.txt
like image 162
Romeo Ninov Avatar answered Jan 23 '26 12:01

Romeo Ninov


Using awk:

$ awk -F'[?&]' 'FNR == NR { url[$1,$2,$3]++; next } url[$1,$2,$3] == 1' urls.txt urls.txt
https://www.example.com/page1.html?id=15645&key=fkldf032&user=250643&group=12
https://www.example.com/page1.html?id=26327&key=xkd9c03n&user=399494&group=15

Reads the file twice; first time to keep a count of how many times the bits you're interested in occur, the second time to print only those that showed up once.

like image 23
Shawn Avatar answered Jan 23 '26 13:01

Shawn