Page tree
Skip to end of metadata
Go to start of metadata


1. Overview

This page describes the functional annotation data made available with the aggV2 data release. In this document we provide information on how the functional annotation was conducted, the location of the data, file formats, and some example queries. 

The intention of providing functional annotation data alongside the genomic data is to facilitate analysis using the data. As such, we don't view this functional dataset as 'final', and warmly encourage feedback and suggestions for what could be added to enrich our data further. Such suggestions can be made via service desk tickets (see section "Help & Support" at the end of this page).

2. The functional annotation files for aggV2 

In order to maintain parity with the genomic data, we have split the functional annotation data into 1371 chunks. Similarly, the output format of the functional annotation files is compressed vcfs (vcf.gz). The CHROM, POS, REF, ALT, FILTER and INFO fields from the genomic data are preserved for the functional annotation, but genotypes are dropped. All annotations released currently are derived from VEP annotations, as described in the section below. It should be noted that we provide annotations for each variant in the aggV2 across all transcripts where the variant is found, so each variant in aggV2 can be included in multiple rows of the functional annotation dataset. 

2.1. VEP annotation

Information are written to the INFO/CSQ field, with a '|' field separator.

The specific version of VEP run are:

Versions:
  ensembl              : 98.e98e194
  ensembl-funcgen      : 98.36eef94
  ensembl-io           : 98.052d23b
  ensembl-variation    : 98.7b96c96
  ensembl-vep          : 98.2


For the full list of annotations provided, please see the 'VEP annotation' code block below. More information on all of these can be found on the VEP documentation page. Additionally, filepaths to specific versions of data are also contained in the command below.

2.1.1. Plug-ins and custom annotations

VEP annotations are described in the table below. Please see the full VEP command code for full file-paths and run options

AnnotationTypeVersion/fileVariants annotated
gnomADCustom annotationV3/gnomad.genomes.r3.0.sites.vcf.bgzAll variants in both aggV2 and gnomAD v3
TOPMedCustom annotationFreeze 5/bravo-dbsnp-all.vcf.gzAll variants in both aggV2 and TOPMed Freeze 5
ClinVarCustom annotationclinvar_20190219.vcf.gzAll variants in both aggV2 and ClinVar (as of 2019-02-19)
PhyloPCustom annotationhg38.phyloP100way.bwAll variants
GERPCustom annotationgerp_conservation_scores.homo_sapiens.GRCh38.bwAll variants
LOFTEEPlugingrch38 branch, commit: 8c111b75a4642479f24e154fde75320c3b1b369eAnnotates stop-gained, splice site disrupting and frameshift variants
CADDPluginv1.5See plugin for details
SpliceRegionPlugin
See plugin for details
SpliceAIPluginv1.3

This plugin is described here (look for section "SpliceAI") and, as explained in that section, simply annotates the input using pre-calculated data. Therefore variants not within the precomputed set will not be annotated. Please see the following from the SpliceAI github: The annotations for all possible substitutions, 1 base insertions, and 1-4 base deletions within genes are available here for download.

2.1.1.1. VEP command used

The full VEP command used is:

VEP annotation
#Setup variables and load modules
    mkdir -p ${out}/VEP_annotation2
    module load bcftools/1.10.2
    module load vep/98
    export PERL5LIB=$PERL5LIB:${LOFTEE38}
    export infile=`sed -n "${LSB_JOBINDEX}p" $bedfile`
    export i=`echo $infile | awk -F "_" 'sub(/.bcf/,"",$7) {print $5"_"$6"_"$7}'`
    export outfile=${out}/VEP_annotation2/VEP_annotation_${i}.vcf.gz

#Read bcf input files, strip genotypes and annotate with VEP
    bcftools view ${input}${infile} -G |  \
     bcftools annotate -x ^INFO/OLD_MULTIALLELIC,INFO/OLD_CLUMPED -Ov | \
    vep --cache \
    --offline \
    --format vcf \
    --vcf \
    --assembly GRCh38 \
    --dir_cache /tools/apps/vep/98/ensembl-vep/.vep \
    --cache_version 98 \
    --verbose \
    --species homo_sapiens \
    --no_stats \
    --fasta /public_data_resources/reference/GRCh38/GRCh38Decoy_no_alt.fa \
    --sift b \
    --polyphen b \
    --ccds \
    --uniprot \
    --hgvs \
    --symbol \
    --numbers \
    --domains \
    --regulatory \
    --canonical \
    --protein \
    --biotype \
    --uniprot \
    --tsl \
    --appris \
    --gene_phenotype \
    --af \
    --af_1kg \
    --af_esp \
    --max_af \
    --pubmed \
    --variant_class \
    --mane \
    --overlaps \
    --custom /public_data_resources/gnomad/v3/gnomad.genomes.r3.0.sites.vcf.bgz,gnomADg,vcf,exact,0,AF,AF_afr,AF_amr,AF_asj,AF_eas,AF_sas,AF_fin,AF_nfe,AF_oth,AF_ami,AF_male,AF_female \
    --custom /public_data_resources/TOPMed/allele_frequencies/bravo-dbsnp-all.vcf.gz,topmedg,vcf,exact,0,AF,SVM \
    --custom /public_data_resources/phylop100way/hg38.phyloP100way.bw,PhyloP,bigwig \
    --custom /public_data_resources/vep_resources/Build-38/gerp_conservation_scores.homo_sapiens.GRCh38.bw,GERP,bigwig \
    --custom /public_data_resources/clinvar/20190219/clinvar/vcf_GRCh38/clinvar_20190219.vcf.gz,ClinVar,vcf,exact,0,CLNDN,CLNDNINCL,CLNDISDB,CLNDISDBINCL,CLNHGVS,CLNREVSTAT,CLNSIG,CLNSIGCONF,CLNSIGINCL,CLNVC,CLNVCSO,CLNVI \
    --plugin LoF,loftee_path:${LOFTEE38},human_ancestor_fa:${LOFTEE38HA},gerp_bigwig:${LOFTEE38GERP},conservation_file:${LOFTEE38SQL} \
    --plugin CADD,/public_data_resources/CADD/v1.5/GRCh38/whole_genome_SNVs.tsv.gz \
    --plugin SpliceRegion \
    --plugin SpliceAI,snv=/public_data_resources/SpliceAI/Predicting_splicing_from_primary_sequence-66029966/genome_scores_v1.3/spliceai_scores.raw.snv.hg38.vcf.gz,indel=/public_data_resources/SpliceAI/Predicting_splicing_from_primary_sequence-66029966/genome_scores_v1.3/spliceai_scores.raw.indel.hg38.vcf.gz \
    --compress_output bgzip \
    --force_overwrite \
    --fork 4 \
    --output_file ${outfile}



3. Where are the data?

All functional data for AggV2 are stored at  

/gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/functional_annotation/VEP/

To maintain continuity, we have separated the functional data into the same chunks (with the same coordinates) as the genomic data. The files are named using the convention

gel_mainProgramme_aggV2_<chunk>_VEPannot.vcf.gz

4. Help & Support

Help with aggV2

Please reach out via the Genomics England Service Desk for any issues related to the aggV2 aggregation or companion datasets, including "aggV2" in the title / description of your inquiry.