Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I remove duplicated SNPs using PLink?

I am working with PLINK to analyse genome-wide data.

Does anyone know how to remove duplicated SNPs?

like image 743
user1236418 Avatar asked Mar 25 '12 19:03

user1236418


3 Answers

In PLINK 1.9, use --list-duplicate-vars suppress-first, which will list duplicates, and remove one (the first one), leaving the other intact. I've know this to slip up though.

Instead of using --exclude as Davy suggested, you can also use --extract, keeping rather than getting rid of a list of SNPs. There's an easy method on any Unix based system (assuming your data are in PED/MAP format and cut up by chromossome):

for i in {1..22}; do
  cat yourfile_chr${i}.map | grep "$i" | cut -f -4 | uniq | cut -f -2 | keepers_chr${i}.txt;
done

This will create a keepers_chr.txt file with SNP IDs for SNPs at unique locations. Then run PLINK feeding it your original file(s) and use --extract keepers_chr, with --make-bed --out unique_file

like image 83
Benjamatic Avatar answered Nov 14 '22 07:11

Benjamatic


There is no command to do it automatically that I am aware of, but the way I have done it in the past is to get a list of SNPs that are duplicated, change the duplicates to rs1001.dup for example, then run --update-allele --update-name and then create a list of the duplicates, so all the entries will have .dup at the end of their names, and then run --extract duplicateSNPs.txt --make-bed --out yourfilename.dups.removed

Getting a list of SNPs that are duplicated shouldn't be too hard if you are familiar with R. Sorry to give you a "well just learn X!!!" answer

like image 41
Davy Kavanagh Avatar answered Nov 14 '22 07:11

Davy Kavanagh


A couple of others ideas that might be of help/interest:

  1. You can also remove vcf duplicates using bcftools with the command bcftools norm -D, --remove-duplicates bcftools documentation can be found at https://samtools.github.io/bcftools/bcftools.html

  2. In the spirit of also just using Unix to remove duplicates, I've previously used the following (input is a compressed vcf file) gunzip -c input.vcf.gz | grep "^[^##]" | cut -f3 | sort | uniq -d > plink.dupvar plink.dupvar is the filename the PLINK program looks for when performing the duplication removal step.

like image 2
Yuri Plotkin Avatar answered Nov 14 '22 09:11

Yuri Plotkin