Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sort and remove duplicates based on column

I have a text file:

$ cat text
542,8,1,418,1
542,9,1,418,1
301,34,1,689070,1
542,9,1,418,1
199,7,1,419,10

I'd like to sort the file based on the first column and remove duplicates using sort, but things are not going as expected.

Approach 1

$ sort -t, -u -b -k1n text
542,8,1,418,1
542,9,1,418,1
199,7,1,419,10
301,34,1,689070,1

It is not sorting based on the first column.

Approach 2

$ sort -t, -u -b -k1n,1n text
199,7,1,419,10
301,34,1,689070,1
542,8,1,418,1

It removes the 542,9,1,418,1 line but I'd like to keep one copy.

It seems that the first approach removes duplicate but not sorts correctly, whereas the second one sorts right but removes more than I want. How should I get the correct result?

like image 226
Yang Avatar asked Jul 25 '13 02:07

Yang


People also ask

How do you sort and remove duplicates in Excel column?

To filter for unique values, click Data > Sort & Filter > Advanced. To remove duplicate values, click Data > Data Tools > Remove Duplicates. To highlight unique or duplicate values, use the Conditional Formatting command in the Style group on the Home tab.

Can you remove duplicates in Excel based on two columns?

Remove Duplicates from Multiple Columns in Excel Select the data. Go to Data –> Data Tools –> Remove Duplicates. In the Remove Duplicates dialog box: If your data has headers, make sure the 'My data has headers' option is checked.

How do you delete duplicate rows in SQL based on two columns?

In SQL, some rows contain duplicate entries in multiple columns(>1). For deleting such rows, we need to use the DELETE keyword along with self-joining the table with itself.


1 Answers

The problem is that when you provide a key to sort the unique occurrences are looked for that particular field. Since the line 542,8,1,418,1 is displayed, sort sees the next two lines starting with 542 as duplicate and filters them out.

Your best bet would be to either sort all columns:

sort -t, -nk1,1 -nk2,2 -nk3,3 -nk4,4 -nk5,5 -u text

or

use awk to filter duplicate lines and pipe it to sort.

awk '!_[$0]++' text | sort -t, -nk1,1
like image 68
jaypal singh Avatar answered Sep 30 '22 03:09

jaypal singh