Page tree
Skip to end of metadata
Go to start of metadata

1. Overview

The queries below work with the genotype data VCFs from aggV2, present in:

/gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/genomic_data

bcftools version

Please use bcftools version 1.10.2-foss-2018b via: module load bio/BCFtools/1.10.2-foss-2018b

Snippets

Within each snippet shown below, most lines end with the '\' character. This is not part of the command but a shorthand notation meaning "keep reading the next line as part of a single command." We do this to split each snippet over multiple lines so it is easier to read. 

1.1. Extracting genotypes for a single variant

Question: "I want to see the genotypes of all samples in aggV2 for a particular variant I know. For example the chr7: 48866984 - G/A SNP

Script: Use bcftools query. Enter the region (chromosome and position) using -r. The -f option formats the output with the attributes included. The > character writes the output to a tab-delimited file. 

#!/bin/bash

module load bio/BCFtools/1.10.2-foss-2018b

bcftools query -r chr7:48866984 \
-f '[%SAMPLE\t%CHROM\t%POS\t%REF\t%ALT\t%INFO/OLD_MULTIALLELIC\t%INFO/OLD_CLUMPED\t%FILTER\t%GT\n]' \
gel_mainProgramme_aggV2_chr7_48864936_51531027.vcf.gz > chr7_48866984_g_a_genotypes.tsv

Output: The output is a tab-delimited file in long-format - where each sample is on a separate line for the same variant. The columns are in the same order as stated in the -f command above. 

Note that if there are multiple variants at the same position (i.e. from a decomposed variant), all variants at the same position will be returned. 

...
LP3000204-DNA_B11       chr7    48866984        G       A       .       .       PASS    0/0
LP3000203-DNA_C11       chr7    48866984        G       A       .       .       PASS    0/1
LP3000202-DNA_F01       chr7    48866984        G       A       .       .       PASS    1/1
LP3000204-DNA_F01       chr7    48866984        G       A       .       .       PASS    0/0
LP3000186-DNA_G01       chr7    48866984        G       A       .       .       PASS    0/1
LP3000192-DNA_D01       chr7    48866984        G       A       .       .       PASS    0/0
...

* data have been randomised and subset

One can also extract additional per-sample FORMAT tags by wrapping the tags in the -f '[ ]' brackets of the bcftools query command. For example if one wanted to see the depth (DP), genotype quality (GQ), and allelic depths (AD) for the variant above per sample, the -f command could be expanded to: 

-f '[%SAMPLE\t%CHROM\t%POS\t%REF\t%ALT\t%INFO/OLD_MULTIALLELIC\t%INFO/OLD_CLUMPED\t%FILTER\t%GT\t%DP\t%GQ\t%AD\n]'

Output: 

...
LP3000204-DNA_B11       chr7    48866984        G       A       .       .       PASS    0/0     21      59      21,0
LP3000203-DNA_C11       chr7    48866984        G       A       .       .       PASS    0/1     28      81      28,15
LP3000202-DNA_F01       chr7    48866984        G       A       .       .       PASS    1/1     17      43      0,15
LP3000204-DNA_F01       chr7    48866984        G       A       .       .       PASS    0/0     36      79      36,0
LP3000186-DNA_G01       chr7    48866984        G       A       .       .       PASS    0/1     36      92      36,36
LP3000192-DNA_D01       chr7    48866984        G       A       .       .       PASS    0/0     26      56      26,0
...

List of regions

A list of regions can also be queried. One must use the -R flag and point to a tab delimited file of CHROM, POS, END.

1.2. Extracting genotypes for a single sample

Question: "I want to see the genotypes of a particular sample in aggV2 for a region of interest. For example all variants for sample LP3000204-DNA_B10 in chr7: 48866084-48866984

Script: Use bcftools query. Enter the sample using -s and the region (chromosome and position) using -r. The -f option formats the output with the attributes included. The > character writes the output to a tab-delimited file. 

#!/bin/bash

module load bio/BCFtools/1.10.2-foss-2018b

bcftools query -s LP3000204-DNA_B10 \
-r chr7:48866084-48866984 \
-f '[%SAMPLE\t%CHROM\t%POS\t%REF\t%ALT\t%INFO/OLD_MULTIALLELIC\t%INFO/OLD_CLUMPED\t%FILTER\t%GT\n]' \
gel_mainProgramme_aggV2_chr7_48864936_51531027.vcf.gz > chr7_48866984_48866984_LP3000204-DNA_B10_genotypes.tsv

Output: The output is a tab-delimited file in long-format - where each sample is on a separate line for the same variant. The columns are in the same order as stated in the -f command above. 

...
LP3000204-DNA_B10       chr7    48866319        G       T       .       .       ABratio 0/1
LP3000204-DNA_B10       chr7    48866343        G       A       .       .       PASS    0/1
LP3000204-DNA_B10       chr7    48866346        A       G       .       .       PASS    1/1
LP3000204-DNA_B10       chr7    48866347        G       A       .       .       PASS    1/1
LP3000204-DNA_B10       chr7    48866357        A       C       .       .       PASS    0/0
LP3000204-DNA_B10       chr7    48866370        C       A       chr7:48866370:C/A/T     .       PASS    0/1
...

* data have been randomised and subset

List of samples

A list of samples can also be queried. One must use the -S flag and point to a single column text file of sample IDs. 

1.3. Extracting allele frequencies and site QC metics for a range of variants

Question: "I want to see the allele frequencies of all variants in a region of interest. For example all variants in chr7: 48866084-48866984

Script: Use bcftools query. Enter the region (chromosome and position) using -r. The -f option formats the output with the attributes included. The > character writes the output to a tab-delimited file. 

#!/bin/bash

module load bio/BCFtools/1.10.2-foss-2018b

bcftools query -r chr7:48866084-48866984 \
-f '%CHROM\t%POS\t%REF\t%ALT\t%INFO/OLD_MULTIALLELIC\t%INFO/OLD_CLUMPED\t%FILTER\t%INFO/AN\t%INFO/AC\t%INFO/AC_Hom\t%INFO/AC_Het\t%INFO/AC_Hemi\n' \
gel_mainProgramme_aggV2_chr7_48864936_51531027.vcf.gz > chr7_48866984_48866984_LP3000204-DNA_B10_genotypes.tsv

Output: The output is a tab-delimited file in wide-format - where each variant is on a separate line across all samples. The columns are in the same order as stated in the -f command above. 

...
chr7    48866111        G       C       .       .       PASS    156390  222     1       222     0
chr7    48866116        A       C       .       .       PASS    156390  8       4       4       0
chr7    48866117        A       T       .       .       PASS    156390  12       2       2       0
chr7    48866118        T       C       .       .       PASS    156390  24       2       4       0
chr7    48866123        C       T       .       .       PASS    156390  110      4       10      0
chr7    48866128        G       A       .       .       PASS    156390  18       4       8       0
...

* data have been randomised and subset

Allele frequency data for aggV2 are housed in the INFO field of the genotype VCFs as shown below:

##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes">
##INFO=<ID=AC_Hom,Number=A,Type=Integer,Description="Allele counts in homozygous genotypes">
##INFO=<ID=AC_Het,Number=A,Type=Integer,Description="Allele counts in heterozygous genotypes">
##INFO=<ID=AC_Hemi,Number=A,Type=Integer,Description="Allele counts in hemizygous genotypes">

One can also extract per-variant summary statistics and site QC metics using bcftools query and including the INFO tag in the query. For example if one wanted to extract the per-variant median depth, median GQ, ABRatio, and HWE p-value in Europeans, the -f command could be expanded to:

-f '%CHROM\t%POS\t%REF\t%ALT\t%INFO/OLD_MULTIALLELIC\t%INFO/OLD_CLUMPED\t%FILTER\t%INFO/medianDepthAll\t%INFO/medianGQ\t%INFO/ABratio\t%INFO/phwe_eur\n'

Output: 

...
chr7    48866111        G       C       .       .       PASS    19      48      0.942   0.533
chr7    48866116        A       C       .       .       PASS    19      48      1       0.5
chr7    48866117        A       T       .       .       PASS    19      48      1       0.5
chr7    48866118        T       C       .       .       PASS    19      48      1       0.5
chr7    48866123        C       T       .       .       PASS    19      48      1       0.5
chr7    48866128        G       A       .       .       PASS    19      48      1       0.5
...

1.4. Filtering for PASS sites only

Question: "I only want to use PASS sites in my analysis". 

Script: Use bcftools view. Use the -i option to set the FILTER to only include PASS sites. The -Oz option writes the output to a gzipped VCF and the -o option specifies the output VCF file name.

#!/bin/bash

module load bio/BCFtools/1.10.2-foss-2018b

bcftools view -i 'FILTER="PASS"' \
-r chr7:48866084-48866984 \
gel_mainProgramme_aggV2_chr7_48864936_51531027.vcf.gz \
-Oz -o chr7_48866984_48866984_PASS_sites.vcf.gz

Output: Gzipped VCF of PASS sites only for the region of interest. 

One can also include (using -i) or exclude (using -e) variants based on the other values in the FILTER field. For example, if one wanted to extract variants that PASS but also include variants that do not fail HWE for Europeans, the query could be amended to:  

-i 'FILTER="PASS" | FILTER="phwe_eur"' 

One could also exclude any sites that fail GQ but PASS all other site QC metrics by using -e as shown: 

-e 'FILTER~"GQ"'