I downloaded 1000 genomes data (chromosome 1 -22), which is in VCF format. How I can combine all chromosomes in a single files? Should I first convert all chromosomes into plink binary files and then do the --bmerge mmerge-list
? Or is there any other way to combine them? Any suggestion please?
You could use PLINK as you suggest. You can also use BCFtools:
https://samtools.github.io/bcftools/bcftools.html
Specifically, the concat
command:
bcftools concat ALL.chr{1..22}.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz -Oz -o ALL.autosomes.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
If you use PLINK, you will likely encounter issue with 1000 Genomes as it contains multi-allelic SNPs, which is not compatible with PLINK. Also, there are SNPs that have the same RS identifier, which is also not compatible with PLINK.
You will need to resolve these issues to get PLINK to work by splitting multi-allelic SNPs into multiple records and remove records with duplicate RS identifiers (or make a new unique identifier).
Moreover, PLINK binary PED does not support genotype probabilities. I do not recall if 1000 Genomes includes this type of information. If it does and you want to retain it, you cannot convert it to binary PED as the genotype probabilities will be hard-called, see:
https://www.cog-genomics.org/plink2/input
Specifically:
Since the PLINK 1 binary format cannot represent genotype probabilities, calls with uncertainty greater than 0.1 are normally treated as missing, and the rest are treated as hard calls.
So, if you plan to retain VCF format for the output, I recommend against using PLINK.
EDIT
Here is method to convert VCF to PLINK:
To build PLINK compatible files from the VCF files, duplicate positions and SNP id need to be merged or removed. Here I opt to remove all duplicate entries. Catalogue duplicate SNP id:
grep -v '^#' <(zcat ALL.chr${chrom}.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz) | cut -f 3 | sort | uniq -d > ${chrom}.dups
Using BCFTools, split multi-allelic SNPs, and using plink remove duplicate SNPs id found in previous step:
bcftools norm -d both -m +any -Ob ALL.chr${chrom}.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz | plink --bcf /dev/stdin --make-bed --out ${chrom} --allow-extra-chr 0 --memory 6000 --exclude ${chrom}.dups
Importantly, this is not the only way to resolve issue in converting VCF to PLINK. For instance, you can uniquely assign identifiers to duplicate RS id.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With