Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to combine all chromosomes in a single file

I downloaded 1000 genomes data (chromosome 1 -22), which is in VCF format. How I can combine all chromosomes in a single files? Should I first convert all chromosomes into plink binary files and then do the --bmerge mmerge-list? Or is there any other way to combine them? Any suggestion please?

like image 896
bha Avatar asked Mar 06 '23 12:03

bha


1 Answers

You could use PLINK as you suggest. You can also use BCFtools:

https://samtools.github.io/bcftools/bcftools.html

Specifically, the concat command:

bcftools concat ALL.chr{1..22}.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz -Oz -o  ALL.autosomes.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz

If you use PLINK, you will likely encounter issue with 1000 Genomes as it contains multi-allelic SNPs, which is not compatible with PLINK. Also, there are SNPs that have the same RS identifier, which is also not compatible with PLINK.

You will need to resolve these issues to get PLINK to work by splitting multi-allelic SNPs into multiple records and remove records with duplicate RS identifiers (or make a new unique identifier).

Moreover, PLINK binary PED does not support genotype probabilities. I do not recall if 1000 Genomes includes this type of information. If it does and you want to retain it, you cannot convert it to binary PED as the genotype probabilities will be hard-called, see:

https://www.cog-genomics.org/plink2/input

Specifically:

Since the PLINK 1 binary format cannot represent genotype probabilities, calls with uncertainty greater than 0.1 are normally treated as missing, and the rest are treated as hard calls.

So, if you plan to retain VCF format for the output, I recommend against using PLINK.

EDIT

Here is method to convert VCF to PLINK:

To build PLINK compatible files from the VCF files, duplicate positions and SNP id need to be merged or removed. Here I opt to remove all duplicate entries. Catalogue duplicate SNP id:

grep -v '^#' <(zcat ALL.chr${chrom}.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz) | cut -f 3 | sort | uniq -d > ${chrom}.dups

Using BCFTools, split multi-allelic SNPs, and using plink remove duplicate SNPs id found in previous step:

bcftools norm -d both -m +any -Ob ALL.chr${chrom}.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz | plink --bcf /dev/stdin --make-bed --out ${chrom} --allow-extra-chr 0 --memory 6000 --exclude ${chrom}.dups

Importantly, this is not the only way to resolve issue in converting VCF to PLINK. For instance, you can uniquely assign identifiers to duplicate RS id.

like image 179
Vince Avatar answered May 09 '23 14:05

Vince