1. Overview
The queries below work with the genotype data VCFs from aggV2, present in:
/gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/genomic_data
bcftools version
Please use bcftools version 1.10.2-foss-2018b via: module load bio
/BCFtools/1
.10.2-foss-2018b
Snippets
Within each snippet shown below, most lines end with the '\' character. This is not part of the command but a shorthand notation meaning "keep reading the next line as part of a single command." We do this to split each snippet over multiple lines so it is easier to read.
1.1. Extracting genotypes for a single variant
Question: "I want to see the genotypes of all samples in aggV2 for a particular variant I know. For example the chr7: 48866984 - G/A SNP
Script: Use bcftools query. Enter the region (chromosome and position) using -r. The -f option formats the output with the attributes included. The > character writes the output to a tab-delimited file.
#!/bin/bash module load bio/BCFtools/1.10.2-foss-2018b bcftools query -r chr7:48866984 \ -f '[%SAMPLE\t%CHROM\t%POS\t%REF\t%ALT\t%INFO/OLD_MULTIALLELIC\t%INFO/OLD_CLUMPED\t%FILTER\t%GT\n]' \ gel_mainProgramme_aggV2_chr7_48864936_51531027.vcf.gz > chr7_48866984_g_a_genotypes.tsv
Output: The output is a tab-delimited file in long-format - where each sample is on a separate line for the same variant. The columns are in the same order as stated in the -f command above.
Note that if there are multiple variants at the same position (i.e. from a decomposed variant), all variants at the same position will be returned.
... LP3000204-DNA_B11 chr7 48866984 G A . . PASS 0/0 LP3000203-DNA_C11 chr7 48866984 G A . . PASS 0/1 LP3000202-DNA_F01 chr7 48866984 G A . . PASS 1/1 LP3000204-DNA_F01 chr7 48866984 G A . . PASS 0/0 LP3000186-DNA_G01 chr7 48866984 G A . . PASS 0/1 LP3000192-DNA_D01 chr7 48866984 G A . . PASS 0/0 ...
* data have been randomised and subset
One can also extract additional per-sample FORMAT tags by wrapping the tags in the -f '[ ]' brackets of the bcftools query command. For example if one wanted to see the depth (DP), genotype quality (GQ), and allelic depths (AD) for the variant above per sample, the -f command could be expanded to:
-f '[%SAMPLE\t%CHROM\t%POS\t%REF\t%ALT\t%INFO/OLD_MULTIALLELIC\t%INFO/OLD_CLUMPED\t%FILTER\t%GT\t%DP\t%GQ\t%AD\n]'
Output:
... LP3000204-DNA_B11 chr7 48866984 G A . . PASS 0/0 21 59 21,0 LP3000203-DNA_C11 chr7 48866984 G A . . PASS 0/1 28 81 28,15 LP3000202-DNA_F01 chr7 48866984 G A . . PASS 1/1 17 43 0,15 LP3000204-DNA_F01 chr7 48866984 G A . . PASS 0/0 36 79 36,0 LP3000186-DNA_G01 chr7 48866984 G A . . PASS 0/1 36 92 36,36 LP3000192-DNA_D01 chr7 48866984 G A . . PASS 0/0 26 56 26,0 ...
List of regions
A list of regions can also be queried. One must use the -R flag and point to a tab delimited file of CHROM, POS, END.
1.2. Extracting genotypes for a single sample
Question: "I want to see the genotypes of a particular sample in aggV2 for a region of interest. For example all variants for sample LP3000204-DNA_B10 in chr7: 48866084-48866984
Script: Use bcftools query. Enter the sample using -s and the region (chromosome and position) using -r. The -f option formats the output with the attributes included. The > character writes the output to a tab-delimited file.
#!/bin/bash module load bio/BCFtools/1.10.2-foss-2018b bcftools query -s LP3000204-DNA_B10 \ -r chr7:48866084-48866984 \ -f '[%SAMPLE\t%CHROM\t%POS\t%REF\t%ALT\t%INFO/OLD_MULTIALLELIC\t%INFO/OLD_CLUMPED\t%FILTER\t%GT\n]' \ gel_mainProgramme_aggV2_chr7_48864936_51531027.vcf.gz > chr7_48866984_48866984_LP3000204-DNA_B10_genotypes.tsv
Output: The output is a tab-delimited file in long-format - where each sample is on a separate line for the same variant. The columns are in the same order as stated in the -f command above.
... LP3000204-DNA_B10 chr7 48866319 G T . . ABratio 0/1 LP3000204-DNA_B10 chr7 48866343 G A . . PASS 0/1 LP3000204-DNA_B10 chr7 48866346 A G . . PASS 1/1 LP3000204-DNA_B10 chr7 48866347 G A . . PASS 1/1 LP3000204-DNA_B10 chr7 48866357 A C . . PASS 0/0 LP3000204-DNA_B10 chr7 48866370 C A chr7:48866370:C/A/T . PASS 0/1 ...
* data have been randomised and subset
List of samples
A list of samples can also be queried. One must use the -S flag and point to a single column text file of sample IDs.
1.3. Extracting allele frequencies and site QC metics for a range of variants
Question: "I want to see the allele frequencies of all variants in a region of interest. For example all variants in chr7: 48866084-48866984
Script: Use bcftools query. Enter the region (chromosome and position) using -r. The -f option formats the output with the attributes included. The > character writes the output to a tab-delimited file.
#!/bin/bash module load bio/BCFtools/1.10.2-foss-2018b bcftools query -r chr7:48866084-48866984 \ -f '%CHROM\t%POS\t%REF\t%ALT\t%INFO/OLD_MULTIALLELIC\t%INFO/OLD_CLUMPED\t%FILTER\t%INFO/AN\t%INFO/AC\t%INFO/AC_Hom\t%INFO/AC_Het\t%INFO/AC_Hemi\n' \ gel_mainProgramme_aggV2_chr7_48864936_51531027.vcf.gz > chr7_48866984_48866984_LP3000204-DNA_B10_genotypes.tsv
Output: The output is a tab-delimited file in wide-format - where each variant is on a separate line across all samples. The columns are in the same order as stated in the -f command above.
... chr7 48866111 G C . . PASS 156390 222 1 222 0 chr7 48866116 A C . . PASS 156390 8 4 4 0 chr7 48866117 A T . . PASS 156390 12 2 2 0 chr7 48866118 T C . . PASS 156390 24 2 4 0 chr7 48866123 C T . . PASS 156390 110 4 10 0 chr7 48866128 G A . . PASS 156390 18 4 8 0 ...
* data have been randomised and subset
Allele frequency data for aggV2 are housed in the INFO field of the genotype VCFs as shown below:
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes"> ##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes"> ##INFO=<ID=AC_Hom,Number=A,Type=Integer,Description="Allele counts in homozygous genotypes"> ##INFO=<ID=AC_Het,Number=A,Type=Integer,Description="Allele counts in heterozygous genotypes"> ##INFO=<ID=AC_Hemi,Number=A,Type=Integer,Description="Allele counts in hemizygous genotypes">
One can also extract per-variant summary statistics and site QC metics using bcftools query and including the INFO tag in the query. For example if one wanted to extract the per-variant median depth, median GQ, ABRatio, and HWE p-value in Europeans, the -f command could be expanded to:
-f '%CHROM\t%POS\t%REF\t%ALT\t%INFO/OLD_MULTIALLELIC\t%INFO/OLD_CLUMPED\t%FILTER\t%INFO/medianDepthAll\t%INFO/medianGQ\t%INFO/ABratio\t%INFO/phwe_eur\n'
Output:
... chr7 48866111 G C . . PASS 19 48 0.942 0.533 chr7 48866116 A C . . PASS 19 48 1 0.5 chr7 48866117 A T . . PASS 19 48 1 0.5 chr7 48866118 T C . . PASS 19 48 1 0.5 chr7 48866123 C T . . PASS 19 48 1 0.5 chr7 48866128 G A . . PASS 19 48 1 0.5 ...
1.4. Filtering for PASS sites only
Question: "I only want to use PASS sites in my analysis".
Script: Use bcftools view. Use the -i option to set the FILTER to only include PASS sites. The -Oz option writes the output to a gzipped VCF and the -o option specifies the output VCF file name.
#!/bin/bash module load bio/BCFtools/1.10.2-foss-2018b bcftools view -i 'FILTER="PASS"' \ -r chr7:48866084-48866984 \ gel_mainProgramme_aggV2_chr7_48864936_51531027.vcf.gz \ -Oz -o chr7_48866984_48866984_PASS_sites.vcf.gz
Output: Gzipped VCF of PASS sites only for the region of interest.
One can also include (using -i) or exclude (using -e) variants based on the other values in the FILTER field. For example, if one wanted to extract variants that PASS but also include variants that do not fail HWE for Europeans, the query could be amended to:
-i 'FILTER="PASS" | FILTER="phwe_eur"'
One could also exclude any sites that fail GQ but PASS all other site QC metrics by using -e as shown:
-e 'FILTER~"GQ"'