Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Selecting unique lines based on two columns

Tags:

unix

awk

I've been trying to figure out a way to select lines that are unique based on the values of two columns. For example, here is a sample/example of my file:

chr1    10    12
chr1    10    12
chr1    10    11
chr1    9    12
chr2    15    20

And this is what I want my output to look like:

chr1    10    12
chr1    10    11
chr1    9    12
chr2    15    20

I've tried to use this code because I like how it doesn't mess with the order of my file:

awk -F"\t" '!_[$2]++' SNP_positions.txt > SNP_positions_uniq.txt

However, it only captures unique features of one column. Is there a way to edit this to make it work for two columns?

like image 390
cosmictypist Avatar asked Jun 17 '15 15:06

cosmictypist


People also ask

How do I get unique values from two columns?

Find unique/duplicate values between two columns with formula. The following formula can also help you to find the unique values, please do as this: In a blank cell B2, enter this formula =IF(ISNA(VLOOKUP(A2,$C$2:$C$13,1,FALSE)),"Yes",""), and then drag this cell's AutoFill Handle to the cell B15.

How do I highlight unique values between two columns in Excel?

Conditional FormattingNavigate to the "Home" option and select duplicate values in the toolbar. Next, navigate to Conditional Formatting in Excel Option. A new window will appear on the screen with options to select "Duplicate" and "Unique" values. You can compare the two columns with matching values or unique values.

Can I use distinct with multiple columns?

We can use the DISTINCT clause on more than columns in MySQL. In this case, the uniqueness of rows in the result set would depend on the combination of all columns.

How do I find unique two column combinations in SQL?

To select distinct combinations from two columns, you can use CASE statement. Let us create a table with some columns.


2 Answers

You can perfectly use an index that uses more than one field for the array elements:

awk -F"\t" '!seen[$2, $3]++' file

In this case we use $2, $3 as index. This way, we will get all different elements of the tuples ($2, $3).

like image 151
fedorqui 'SO stop harming' Avatar answered Oct 01 '22 00:10

fedorqui 'SO stop harming'


The awk solution provided is great and fast, but I came here myself looking for the sort solution, which is probably slower in almost all cases:

cat file | sort -u -k1,1 -k2,2 -k3,3 -s

I found the answer here: https://stackoverflow.com/a/12546627/778533

like image 34
tommy.carstensen Avatar answered Sep 30 '22 22:09

tommy.carstensen