1. Overview
There is a very useful plugin for bcftools called split-vep (available with bcftools version 1.10.2-foss-2018b) which we will make use of here to query and parse the functional annotation from VEP. Please read the documentation for more information.
bcftools version
Please use bcftools version 1.10.2-foss-2018b via: module load bio
/BCFtools/1
.10.2-foss-2018b
Snippets
Within each snippet shown below, most lines end with the '\' character. This is not part of the command but a shorthand notation meaning "keep reading the next line as part of a single command." We do this to split each snippet over multiple lines so it is easier to read.
The queries below work with the functional annotation data files present in:
/gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/functional_annotation/VEP
One can view all if the available VEP annotations using the command below - all annotations can be extracted and / or filtered for.
#!/bin/bash module load bio/BCFtools/1.10.2-foss-2018b bcftools +split-vep -l gel_mainProgramme_aggV2_chr7_48864936_51531027_VEPannot.vcf.gz
Output:
... 1 Consequence 2 IMPACT 3 SYMBOL 4 Gene 5 Feature_type 6 Feature 7 BIOTYPE 8 EXON 9 INTRON 10 HGVSc ...
1.1. Extract variants for a gene of interest
Question: "I want to extract all variants in the gene IKZF1 and view some basic annotation".
Script: Use bcftools split-vep. This example will output variants annotated against all transcripts for the gene of interest - IKZF1 - using the -i flag. There will be one annotation per line (for each transcript - using the -d flag). The -f option formats the output with the attributes included. The > character writes the output to a tab-delimited file.
#!/bin/bash module load bio/BCFtools/1.10.2-foss-2018b bcftools +split-vep gel_mainProgramme_aggV2_chr7_48864936_51531027_VEPannot.vcf.gz \ -i 'SYMBOL="IKZF1"' -d \ -f '%CHROM\t%POS\t%REF\t%ALT\t%SYMBOL\t%Feature\t%Consequence\t%Existing_variation\n' > IKZF1_variants.tsv
Output: The output is a tab-delimited file in long-format - where each annotation is on a separate line across all variants. The columns are in the same order as stated in the -f command above.
... chr7 50319076 G A IKZF1 ENST00000641948 non_coding_transcript_exon_variant rs374267123 chr7 50319076 G A IKZF1 ENST00000642219 synonymous_variant rs374267123 chr7 50319076 G A IKZF1 ENST00000645066 synonymous_variant rs374267123 chr7 50319076 G A IKZF1 ENST00000646110 synonymous_variant rs374267123 chr7 50319076 G C IKZF1 ENST00000331340 missense_variant rs374267123 chr7 50319076 G C IKZF1 ENST00000343574 missense_variant rs374267123 ...
* data have been randomised and subset
1.2. Extract variants for a gene of interest with a damaging prediction
Question: "I want to view the variants that that are missense or worse and rare in gnomAD in the gene IKZF1".
Script: Use bcftools split-vep. This example will output variants annotated against all transcripts for the gene of interest - IKZF1 - that are missense or worse (-s and -S flag) and rare in Europeans (use the -c flag for numeric conversion). There will be one annotation per line (for each transcript - using the -d flag). The -f option formats the output with the attributes included. The > character writes the output to a tab-delimited file.
#!/bin/bash module load bio/BCFtools/1.10.2-foss-2018b bcftools +split-vep gel_mainProgramme_aggV2_chr7_48864936_51531027_VEPannot.vcf.gz \ -i 'SYMBOL="IKZF1" & EUR_AF<0.05' -d \ -f '%CHROM\t%POS\t%REF\t%ALT\t%SYMBOL\t%Feature\t%Consequence\t%Existing_variation\t%EUR_AF\n' \ -c SYMBOL,Feature,Consequence,EUR_AF:Float,Existing_variation \ -s worst:missense+ -S ../../additional_data/vep_severity_scale/VEP_severity_scale_2020.txt > IKZF1_rare_missense.tsv
Output: The output is a tab-delimited file in long-format - where each annotation is on a separate line across all variants. The columns are in the same order as stated in the -f command above.
... chr7 50368156 C T IKZF1 ENST00000413698 missense_variant rs558055360 0 chr7 50368201 A G IKZF1 ENST00000413698 missense_variant rs117111762 0.002 chr7 50368251 A G IKZF1 ENST00000413698 missense_variant rs573829014 0 chr7 50368281 G A IKZF1 ENST00000413698 missense_variant rs562525663 0 chr7 50368296 G A IKZF1 ENST00000413698 missense_variant rs544990441 0 chr7 50376578 G A IKZF1 ENST00000331340 missense_variant rs144637662&COSM6972304 0.001 chr7 50376689 G A IKZF1 ENST00000331340 missense_variant rs549930725 0.001 chr7 50400076 G A IKZF1 ENST00000331340 missense_variant rs148169768 0 chr7 50400355 G C IKZF1 ENST00000331340 missense_variant rs529231990 0 ...
* data have been randomised and subset
Please note that when specifying severity constraints using the -s flag, one should currently specify the severity scale explicitly using the -S flag. This is due to a known bug in bcftools 1.10.x (outdated default severity scale), which will be resolved in bcftools 1.11. The severity scale can be found in the location below:
/gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/additional_data/vep_severity_scale/VEP_severity_scale_2020.txt
1.3. Additional Annotation Queries
Below are some additional queries using split-vep that extract useful annotation.
module load bio/BCFtools/1.10.2-foss-2018b #Example infile infile='gel_mainProgramme_aggV2_chr10_10064150_12359131_VEPannot.vcf.gz' #View all the types of VEP data annotated to file bcftools +split-vep ${infile} -l #If we want to view just all gnomAD annotations we can pipe to grep bcftools +split-vep ${infile} -l | grep -i gnomADg #Check CADD scores bcftools +split-vep ${infile} -f '%CHROM:%POS-%REF/%ALT %CADD_RAW %CADD_PHRED\n' -d #Check LOFTEE bcftools +split-vep ${infile} -f '%CHROM:%POS-%REF/%ALT %LoF %LoF_filter \n' -d #Check gnomAD custom annotations bcftools +split-vep ${infile} -f '%CHROM:%POS-%REF/%ALT %gnomADg_AF %gnomADg_AF_eas %gnomADg_AF_sas\n' -d #Check TOPMed custom annotations bcftools +split-vep ${infile} -f '%CHROM:%POS-%REF/%ALT %topmedg %topmedg_AF topmedg_SVM \n' -d #Check ClinVar custom annotations bcftools +split-vep ${infile} -f '%CHROM:%POS-%REF/%ALT %ClinVar_CLNDN %ClinVar_CLNREVSTAT \n' -d