This documentation is about release v1 of the de novo variant (DNV) research dataset for the 100,000 Genomes Project in the research environment (RE), built upon the Main Programme V9 Data Release (released ). The dataset comprises genome-wide DNV annotation for 13,949 trios from 12,609 families from the rare disease programme.
Genome-wide DNV annotation was performed for all trios that have been successfully run through the Genomics England Rare Disease Interpretation Pipeline as of Data Release v9. The DNV annotation pipeline flags likely DNVs for each trio based on an array of filters that interrogate the multi-sample VCF outputs of the Platypus variant caller. The filters are grouped into two broad categories, base and stringent, and each variant is flagged if it fails any particular filter. We recommend using DNVs that pass the stringent filter in general, as these are more likely to be true DNVs.
It is important to note that we adopt an inclusive approach by flagging likely DNVs in the VCF files without filtering any variants out. We provide all Mendelian inconsistencies in the VCF files for each family, together with a set of associated filters at the variant level, so that researchers can apply their own custom strategies to assess DNVs as required. At the end of this document, we supply code snippets and additional information to help users interrogate the DNV dataset.
The outputs of the DNV research dataset are:
|denovo_cohort_information||A LabKey table with cohort information for all participants included in the DNV dataset. Attributes within this table include: participant ID, sex, affection status, family ID, pedigree ID, and the path to each family's multi-sample VCF with flagged DNVs.|
A LabKey table of all variants that pass base_filter for all trios within the DNV dataset. The table does not include variants that fail the base_filter due to size restrictions, but these can be found in the annotated multi-sample VCFs. This table includes all flags from the DNV annotation pipeline for each variant.
|annotated multi-sample VCFs (family level)|
All multi-sample VCFs per family with DNVs flagged within the FORMAT field. These VCFs are functionally annotated with VEP and accessible within the filesystem. File paths per participant are included in the denovo_cohort_information LabKey table. The data can be found in directory: /gel_data_resources/main_programme/denovo_variant_dataset/
|If you have any queries about the DNV dataset not covered in the FAQ at the end of this page, please contact us via the Genomics England Service Desk.|
The overall workflow of how DNVs are flagged is shown in the diagram below. It includes the general steps of the Genomics England Rare Disease Interpretation Pipeline which are then fed into the de novo variant annotation pipeline. You can read more about the Genomics England Rare Disease Interpretation Pipeline in the Rare Disease Results Guide.
Step 1: Only families containing at least one trio that have successfully run through the Genomics England Rare Disease Interpretation Pipeline are considered in the DNV analysis.
Step 2: The BAMs for each member of the family are then fetched on the filesystem. The alignment is performed by the iSSAC aligner from Illumina.
Step 3: Small variants (SNPs and insertions/deletions up to 1,500bp) are jointly-called (and locally realigned) by family using the Platypus variant caller. This creates a multi-sample VCF for each family and contains the variants for every member of that family.
Note that some families contain additional members to the core trio such as siblings to the proband. We say that these families contain 'nested trios' as the DNV annotation pipeline will be run for each nested trio within a family.
Step 4: For each family, the multi-sample VCF is fed into the Platypus bayesiandenovofilter.py python script. This script accompanies the Platypus variant caller. The pedigree and sex of each sample are also used as input. The script was run with default parameters - without changing the priors. More information can be found in the Additional Information section below. The output of this script is a multi-sample VCF with the same format as the input but only containing Mendelian inconsistencies for each offspring within the family.
In multi-sample VCFs containing nested-trios, Mendelian inconsistencies are reported per trio. Therefore, in a single variant line, one offspring may have a Mendelian inconsistency whilst the other may not.
Step 5: The custom DNV annotation script (process_denovo.R) is run separately by trio (i.e. nested trios within a family are run separately) using the multi-sample VCF as input. This script flags putative DNVs based on a set of filters shown below. The output of this script is a multi-sample VCF by family containing Mendelian inconsistencies for each offspring. The FORMAT column of the VCF is populated with the flags used to show evidence of the variant being a true DNV based on the filters shown below. Each VCF is then functionally annotated using Ensembl's Variant Effect Predictor (VEP).
Step 6: The VCF output for each family from Step 5 is then converted into a single data-frame and made accessible via LabKey. VEP annotation is not included in the LabKey table due to size restrictions (a single variant may have multiple genomic annotations due to the number of transcripts/isoforms).
3. The de novo variant annotation strategy
The DNV annotation strategy was built upon a combination of what is currently known in the literature and by advice and collaboration with members of Matt Hurles's group at the Wellcome Sanger Institute. We would like to thank Joanna Kaplanis and Patrick Short in particular (but any errors or omissions are our own). Initial checks suggest that the currently applied filters yield high specificity but may be filtering out some true de novo single nucleotide variants. We welcome all constructive feedback and any suggestions of how these filters could be modified in subsequent releases; please provide any such input via the Genomics England Service Desk.
The strategy is implemented in the custom R script (process_denovo.R) which is made available within the Research Environment. It comprises a series of four general filters that are grouped in to: Global, Base, Stringent, and Additional. These are used to flag each variant within a trio from a multi-sample VCF of Mendelian inconsistencies (Step 4). The flags for each variant within a trio are presented in the LabKey tables and within the annotated VCFs as described in the next section.
Two global filters are firstly implemented. Beginning with the multi-sample VCF joint-called by family using Platypus, only Mendelian inconsistencies with the FILTER attribute set to PASS are considered. Additionally, only variants on the autosomes (1-22), the X-chromosome (X), and the mitochondrial chromosome (M) are considered (variants on scaffolds for example are excluded).
Four filters are included in the base_filter category and aim to flag variants that are far more likely to be Mendelian inconsistencies than true DNVs.
Within a trio, if any of these four filters are marked as fail, then the base_filter is also set to fail.
Filters marked with an asterisk (*) are amended to account for variants on the X-chromosome as shown below.
The genotype must be heterozygous (1/0 or 0/1) for the offspring and reference homozygous (0/0) in the mother and the father.
|We only consider DNVs arising from a reference background.|
Minimum depth of 20X in the offspring and each of the parents.
|This is taken from the NR FORMAT field in the VCF and ensures that DNVs have good coverage across all members of the trio.|
Maximum depth of 98X in the offspring.
|Following findings from Rahbari et. al, 2016|
|base_filter||All four individual base filters pass.||-|
The two base filters below are adjusted to account for offspring sex and the X-chromosome. Please see the Additional Information section at the bottom of the page to understand how variants are coded on the X-chromosome by Platypus.
|zygosity_filter||We only consider DNVs arising from a reference background.|
|mindepth_filter||This is taken from the NR FORMAT field in the VCF and ensures that DNVs have good coverage across all members of the trio.|
An additional six filters are included in the stringent_filter category and aim to discriminate probable DNVs (that pass the base_filter) from high confidence DNVs. Variants have to pass the base_filter in order to be considered for the stringent_filter.
Within a trio, if any of these six filters are marked as fail, then the stringent_filter is also set to fail.
No more than one read supporting the alternate allele in either the mother or the father.
|This is taken from the NV FORMAT field in the VCF and ensures that there is minimal evidence of residual reads supporting the alternate allele in either parent.|
The AB ratio in the offspring is between 0.3 and 0.7.
|The AB ratio is calculated by dividing the number of reads supporting the variant (NV) to the total number of reads at the variant site (NR). It ensures that the variant is not in allelic imbalance.|
|proximity_filter||DNV is not located within 20bp of another DNV within the same trio.||This only applies for variants that have already passed the base_filter. Therefore if two variants that pass the base_filter are within 20bp of one another within a trio, they are flagged as fail for the proximity_filter.|
No overlap with segmental duplications.
|Taken from UCSC Table Browser for each genome assembly (Segmental Dups). Segmental duplications play an important role in both genomic disease and gene evolution. This track displays an analysis of the global organisation of these long-range segments of identity in genomic sequence.|
No overlap with simple repeat regions.
|Taken from UCSC Table Browser for each genome assembly (Simple Repeats). This track displays simple tandem repeats (possibly imperfect repeats) located by Tandem Repeats Finder (TRF) which is specialised for this purpose.|
No overlap with patch regions.
|Taken from UCSC Table Browser for each genome assembly (Fix Patches). When errors are corrected in the reference genome assembly, the Genome Reference Consortium (GRC) adds fix patch sequences containing the corrected regions.|
|stringent_filter||All base and stringent filters pass.||-|
These additional flags are to aid downstream analysis. They are not used to inform the variant filters.
Flags marked as an asterisk are only available in the LabKey table denovo_flagged_variants and not in the annotated multi-sample VCFs.
|snp_filter*||Flags whether the variant is a Single Nucleotide Polymorphism (SNP).|
|in_par*||Flags whether the variant is in the pseudo-autosomal region (PAR).|
0: Not in PAR;
1: In PAR
|gene_filter*||Flags whether the variant is in a coding exon (taken from GENCODE v32).|
0: In coding exon;
1: Not in coding exon
|multidenovo_filter||No more than one identical Mendelian Inconsistency (by combining the chromosome, position, reference allele, and alternate allele) per trio within the same family. Only applicable for families containing nested trios. For families with nested trios (containing an additional sibling for example), if the same variant is determined to be a Mendelian Inconsistency in both siblings, this flag this is set to 0 (multidenovo).|
1: not multidenovo
Additional filters are not used as filters within the DNV annotation pipeline but can be used as a manual filter for downstream analysis.
As mentioned, the DNV dataset are presented in two formats: Two LabKey tables and annotated multi-samples VCFs per family on the file-system.
We recommend to use DNVs that pass the stringent_filter for general analysis as these are more likely to represent true DNVs.
There are two LabKey tables containing the DNV dataset: denovo_cohort_information and denovo_flagged_variants. These can be accessed from the LabKey Desktop application and found within the main-programme_v9_2020-04-02 project folder. The full column schema and column definitions can be found in the Main Programme V9 Data Dictionary.
This table contains the cohort meta-data for every participant within the DNV dataset. Each row comprises a unique participant per genome assembly (there are a very small number of families that exist on both the GRCh37 and GRCh38 cohorts). The columns detail the information used to perform Rare Disease Interpretation by Genomics England such as the: trio_id, family_id, plate_key, participant_id, relationship_to_offspring, pedigree_id, sex, affection_status (relative to the recruited disease of the proband), and assembly.
The trio_id column is a unique identifier for each trio within a the DNV dataset. It is composed of three parts and follows the format: familyid_interpretationid.trionumber. For example: 080690_10002.2.
The additional columns in the denovo_cohort_information LabKey table are defined as below:
|Flags whether the family contains nested trios. 0: the family is a simple trio; 1: the family contains nested trios (more than one trio).|
The path to the annotated multi-sample VCF by family on the file-system. This VCF contains the flagged DNVs.
|base_filter_total||The total number of distinct variants that pass the base_filter (for offspring only - is blank for mother and father).|
The total number of distinct variants that pass the stringent_filter (for offspring only - is blank for mother and father).
The file-paths to the annotated multi-sample VCFs (in the column vcf_path_flagged_denovo) represent the path on the HPC environment. If you need to access these files on the Desktop environment, please add the '~' suffix, such as '~/gel_data_resources/...'
This table comprises all putative DNVs that pass the base_filter per trio. Each row comprises a unique variant per trio per assembly. The variant-level information is found in the columns: chrom, position, reference, alternate. The trio_id, family_id, and assembly columns are included so that it is possible to join this table onto the denovo_cohort_information table (which contains the cohort meta-data).
Due to size restrictions, only variants that pass the base_filter per trio are included in the LabKey table denovo_flagged_variants. For all Mendelian inconsistencies per trio with flagged DNVs, please use the annotated multi-sample VCFs.
All columns ending with the suffix '_filter' contain the flags (coded as 0 or 1) indicating whether or not the variant within the trio has failed (0) or passed (1) the respective filter. Please see the above section for the list of filters as their pass criteria.
The genotypes for all members of the trio are included in columns offspring_gt, mother_gt, father_gt. The remaining columns contain the VCF FORMAT attributes (such as depth, genotype quality, genotype likelihood) from each sample within the trio. Again, the full column schema and column definitions can be found in the Main Programme V9 Data Dictionary.
The denovo_flagged_variants LabKey table contains the column bayes_factor. This is the output from the Platypus bayesiandenovofilter.py python script. We decided not to include this metric as a filter criteria in the DNV annotation pipeline in this release. Users can use the Bayes Factor, although it is only available for non-duplicated variants from single (non multiplex) trio families. For such variants, it is not possible to attribute the Bayes Factor (coded in the INFO field) to the correct variant ~ sample combination (as explained in the Further Information Section).
The two LabKey tables can be queried for families and DNVs of interest using either the LabKey graphical interface (desktop application) or the LabKey APIs. The Code Book section below shows examples of how to query the DNV dataset in LabKey. Note that the variants in the denovo_flagged_variants table does not include genomic annotation by Ensembl VEP as this causes large amount of row duplication (due to each variant having annotations per transcript).
For all filter columns within the LabKey table, denovo_flagged_variants, a value of 0 indicates that the filter criteria has not been met (FAIL) and a value of 1 means that the filter criteria has been met (PASS).
The output of the DNV annotation pipeline is an annotated multi-sample VCF per family containing all Mendelian inconsistencies with putative DNVs flagged in the FORMAT column and genomic annotation in the INFO column (see Step 5 above in the DNV pipeline).
The annotated multisample VCFs can be found in the filesystem separated by genome assembly under the folders:
The files themselves are named by family_id and interpretation_id with the suffix .denovo.reheader.vcf.gz (for example: 00004-RTD_10001.denovo.reheader.vcf.gz).
Remember these paths correspond to the file-paths on the HPC environment. If you need to access these files on the Desktop environment, please add the '~' suffix, such as '~/gel_data_resources/...'.
Please use the LabKey table denovo_cohort_information to associate the annotated multi-sample VCF file-path (under the column vcf_path_flagged_denovo) with a particular trio_id.
The annotated multi-sample VCFs by family (containing all Mendelian inconsistencies) have the FORMAT column populated with the results of the DNV annotation pipeline under the DE_NOVO_FLAG attribute as shown below.
##FORMAT=<ID=DE_NOVO_FLAG,Number=.,Type=String,Description="Flag from the Genomics England De Novo Flagging Pipeline">
There is a certain order to how this field is populated:
Note that only the offspring will have the flags populated in their DE_NOVO_FLAG FORMAT fields. The mother and father are set to missing "." for this attribute.
Please use the Code Book at the bottom of this page for example scripts on how to query the annotated multi-sample VCFs. We recommend using the command line tool, bcftools, to interrogate the annotated multi-sample VCFs.
The flags in the FORMAT column for the annotated multi-sample VCFs are as follows:
|Flag in FORMAT DE_NOVO_FLAG column||Description|
|base_fail||The variant fails a base_filter and/or a global_filter (PASS variants on chromosomes: 1-22, X, M). The exact base_filter that fails can be found in the LabKey table: denovo_flagged_variants or queried from the VCF FORMAT attributes (we wanted to limit the number of flags in the FORMAT column).|
These flags indicate variants that pass the base_filter but fail one or more of the stringent_filters. Note that variants have to pass the base_filter in order to be considered for the stringent_filter. The particular stringent_filter(s) that fail are listed using a semi-colon separator.
For example if a variant fails just the altreadparent filter, the FORMAT format will be marked as altreadparent.
If a variant fails the altreadparent and the abratio filter, the FORMAT format will be marked as altreadparent;abratio.
|The variant passes the base_filter and stringent_filter.|
To create the annotated multi-sample VCFs (Step 5), all original FILTER field values from the Platypus joint-calling step (Step 3) are stripped and replaced with the flags from the DNV annotation pipeline.
As mentioned, we adopt an inclusive approach for researchers to analyse DNVs by flagging likely DNVs and generally not filtering any variants out. If necessary, one can make use of the additional attributes within the annotated multi-sample VCF to perform custom filtering. Please see the Code Book below for example on how to do this. Below are a list of important attributes one can make use of:
Used in filter
|FILTER||FILTER||The FILTER column of the annotated multi-sample VCF has not been modified from the original Platypus VCF. As only PASS variants are included in the annotated multi-sample VCF from the Platypus bayesiandenovofilter.py script, the values in the FILTER field will always be set to 'PASS'.||Not used.|
|FORMAT||NR||Number of reads covering variant location in this sample.||mindepth_filter, maxdepth_filter, abratio_filter.|
|FORMAT||NV||Number of reads containing variant in this sample.||altreadparent_filter, abratio_filter.|
|FORMAT||GL||Genotype log10-likelihoods for AA,AB and BB genotypes, where A = ref and B = variant.||Used to calculate Bayes Factor.|
Genotype quality as phred score.
|Not used in filters.|
Goodness of fit value.
|Not used in filters.|
|INFO||bayesFactor||Bayes Factor for the de novo model calculated by the Platypus bayesiandenovofilter.py python script.||Not used in filters.|
|INFO||multidenovo_filter||Flag of the the multidenovo filter. This is only applicable for multiplex families containing nested trios as described above.|
Not used in filters.
1: not multidenovo
The table below shows the number of families, trios (including nested trios), families with nested trios, and participants within the DNV dataset for the Main Programme V9 Data Release for the GRCh37 and GRCh38 cohorts, as well as the combined cohort. As shown, the minority of families contain nested trios.
|Number of families (total)||1,762||10,847||12,609|
|Number of trios||1,921||12,028||13,949|
|Number of families (from row 1) with nested trios||151||1,134||1,285|
|Number of participants (total, across all families)||5,447||33,732||39,179|
Please note there are there 11 families that exist on both the GRCh37 and GRCh38 cohorts. Please take this into account when filtering the LabKey tables (use the column: assembly).
The table below shows the distribution per trio of Mendelian inconsistencies, base_filter pass, and stringent_filter pass variants for the GRCh37, GRCh38, and combined cohorts. As is shown below, there is a small handful of trios that lie outside of the expected distribution (high rates of stringent_filter pass DNVs for example). It was found that 12 families did not have any base_filter or stringent_filter pass variants on chromosomes (1:22, X, M) in the GRCh38 cohort.
|Mendelian inconsistencies*||base_filter pass DNVs||stringent_filter_pass DNVs|
|Metric / Cohort||GRCh37||GRCh38||Combined||GRCh37||GRCh38||Combined||GRCh37||GRCh38||Combined|
* Note that trios derived from families containing nested trios are not counted in the Mendelian inconsistencies.
The below plot shows the distribution of total Mendelian inconsistencies per trio (trios derived from families containing nested trios are excluded) per cohort.
Values outside four standard deviations from the mean are not included in the plot.
The below plot shows the distribution of total base_filter pass variants per trio (all trios included). A bimodal distribution was observed (left panel) which upon further inspection was shown to be driven by the distribution of base_filter pass variants for males on the X-chromosome (middle-panel; showing the combined cohort). Additionally, it was observed that on average, more variants pass the base_filter in the GRCh37 cohort than the GRCh38 cohort (right panel; showing the combined cohort).
Values outside four standard deviations from the mean are excluded in the plot.
The below plot shows the distribution of total stringent_filter pass variants per trio (all trios included). After applying the stringent_filter (which includes flagging of problematic genomic regions; such as simple repeats and segmental duplications), a normal distribution of DNVs per trio was observed - centered around a mean of 72 and median of 71 stringent_filter pass DNVs per trio across both cohorts.
Values outside four standard deviations from the mean are not included in the plot.
The plot below shows the percentage of base_filter pass variants that fail each of individual stringent filters.
The plot below shows the distribution of stringent_filter pass DNVs across chromosomes.
This code book is to help users interrogate the DNV dataset with ease. We aim to update this with popular requests from user feedback. Please make sure to submit all scripts to the High-Performance Compute cluster and to not run any scripts directly on the head-node (hpc-prod-grid-login-gecip-01). Most jobs should be fine for submission to the short queue as they will take less than 4 hours to complete - or as an interactive job.
The below code snippets assume that you are working in the HPC environment. Use the LabKey table denovo_cohort_information to fetch the file paths to the annotated multi-sample VCFs for your cohort of interest. Examples of how to do this are in the LabKey section below.
Remember that you will need to change certain file paths to the inputs and outputs depending on your working location on the HPC. If you have any questions about the DNV dataset, please contact the Service Desk.
Location of the annotated multi-sample VCFs
The annotated multi-sample VCFs are found in the directory below under flagged_vcf and split by the GRCh37 and GRCh38 cohorts. In the cohort folders, are the full file paths to all annotated multi-sample VCFs per cohort.
/gel_data_resources/main_programme/denovo_variant_dataset | ├── GRCh37 │ └── 20200326 │ ├── cohort │ └── flagged_vcf └── GRCh38 └── 20200326 ├── cohort └── flagged_vcf
Basic filtering of the annotated multi-sample VCFs
The multi-sample VCFs can be queried using bcftools to include (-i) or exclude (-e) flagged variants that meet certain criteria. The flags are populated in the FORMAT column called DE_NOVO_FLAG as described in this document.
1) Include only stringent_filter pass DNVs in a single family, write results as compressed VCF:
#!/bin/bash module load bcftools/1.10.2 bcftools view -i 'DE_NOVO_FLAG="DENOVO"' 00004-RTD_10001.denovo.reheader.vcf.gz -O z -o 00004-RTD_10001_only_denovo.vcf.gz
2) Include only stringent_filter pass DNVs in a single family, write results as a tab-delimited text file of variants, sample IDs, and genotypes:
#!/bin/bash module load bcftools/1.10.2 bcftools query -f '[%CHROM\t%POS\t%REF\t%ALT\t%SAMPLE\t%GT\n]' -i 'DE_NOVO_FLAG="DENOVO"' 00004-RTD_10001.denovo.reheader.vcf.gz > 00004-RTD_10001_only_denovo_sample_genotypes.tsv
1 76390973 A T LP2000950-DNA_D07 1/0 1 11623896 G A LP2000950-DNA_D07 1/0 1 18103688 C T LP2000950-DNA_D07 0/1 1 23653370 A G LP2000950-DNA_D07 0/1 2 15659761 T C LP2000950-DNA_D07 1/0 2 20627434 C T LP2000950-DNA_D07 0/1 ...
* data have been randomised
3) Include only base_filter pass DNVs in a single family that fail a single stringent_filter (eg simplerepeat), write results as compressed VCF:
#!/bin/bash module load bcftools/1.10.2 bcftools view -i 'DE_NOVO_FLAG="simplerepeat"' 00004-RTD_10001.denovo.reheader.vcf.gz -O z -o 00004-RTD_10001_simplerepeat_fail.vcf.gz
4) Include only base_filter pass DNVs in a single family that fail more than one stringent_filter (eg simplerepeat & abratio), write results as compressed VCF:
#!/bin/bash module load bcftools/1.10.2 bcftools view -i 'DE_NOVO_FLAG="simplerepeat" | DE_NOVO_FLAG="abratio"' 00004-RTD_10001.denovo.reheader.vcf.gz -O z -o 00004-RTD_10001_simplerepeat_abratio_fail.vcf.gz
5) Exclude base_filter fail DNVs and exclude DNVs that fail a stringent filter (eg patch) in a single family, write results as compressed VCF:
#!/bin/bash module load bcftools/1.10.2 bcftools view -e 'DE_NOVO_FLAG="base_fail" | DE_NOVO_FLAG="patch"' 00004-RTD_10001.denovo.reheader.vcf.gz -O z -o 00004-RTD_10001_base_and_patch_fail.vcf.gz
6) For families with nested trios (e.g. multiple offspring) and simple trios, one can print out the sample ID and also the de novo flag value for each sample. Note that the DE_NOVO_FLAG FORMAT attribute for the mother and father samples is always set to "." missing. One can filter these missing samples out by including the flag as shown below (-i 'DE_NOVO_FLAG!="."'):
#!/bin/bash module load bcftools/1.10.2 bcftools query -i 'DE_NOVO_FLAG!="."' -f '[%CHROM\t%POS\t%REF\t%ALT\t%SAMPLE\t%GT\t%DE_NOVO_FLAG\n]' 00058-RTD_10127.denovo.reheader.vcf.gz > 00058-RTD_10127_sample_genotypes.tsv
2 128586362 T G LP3000128-DNA_E05 1/0 DENOVO 2 128586362 T G LP3000135-DNA_E01 0/0 base_fail 2 130215368 C A LP3000128-DNA_E05 0/0 base_fail 2 130215368 C A LP3000135-DNA_E01 1/0 altreadparent;abratio ...
* data have been randomised
7) Example 6 shows the output in long-format (where the variant is duplicated by the number of samples). It is also possible to show the same data in wide-format - with each column being maintaining the same sample id and order in the VCF. Now one variant per line is shown:
#!/bin/bash module load bcftools/1.10.2 bcftools query -f '%CHROM\t%POS\t%REF\t%ALT\t[%GT\t%DE_NOVO_FLAG\t]\n' 00058-RTD_10127.denovo.reheader.vcf.gz > 00058-RTD_10127_sample_genotypes.tsv
2 2255217 A G 0/1 DENOVO 0/1 base_fail 0/0 . 0/0 . 2 2270459 AT A 1/0 base_fail 1/0 base_fail 0/0 . 0/0 . 2 2318407 C T 0/1 altreadparent;abratio 0/1 altreadparent;abratio 0/0 . 0/0 . ...
* data have been randomised
More advanced filtering of the annotated multi-sample VCFs
It is also possible to perform more advanced filtering by querying the attributes within the INFO field; such as the genomic annotation from VEP. Also, multiple filtering steps can be chained together in the same command to build complex queries.
There is a very useful plugin for bcftools called split-vep which we will make use of here to query and parse the genomic annotation from VEP (available for recent versions of bcftools). Please read their documentation for more information.
1) Include only variants in a single family for a gene of interest (eg ACTL6B), write results as a tab-delimited text file of variants with specified genomic annotation. This example will output variants annotated against all transcripts for gene of interest, there will be one annotation per line (for each transcript - using the -d flag). The annotation columns to be printed out can be manually selected (gnomAD_NFE, Consequence etc). It is also possible to query many genes at a time (such as: -i 'SYMBOL="ACTL6B" | SYMBOL="IL9R"'), or by applying a filter and using grep on the output:
#!/bin/bash module load bcftools/1.10.2 bcftools +split-vep 00004-RTD_10001.denovo.reheader.vcf.gz -d -f '%CHROM\t%POS\t%REF\t%ALT\t%SYMBOL\t%Feature\t%Consequence\t%gnomAD_NFE_AF\t%Existing_variation\n' -i 'SYMBOL="ACTL6B"' > 00004-RTD_10001_gene_only.tsv
7 100244267 C T ACTL6B ENST00000160382 missense_variant . rs1131695228 7 100244267 C T ACTL6B ENST00000461605 downstream_gene_variant . rs1131695228 7 100244267 C T ACTL6B ENST00000485601 downstream_gene_variant . rs1131695228 7 100244267 C T ACTL6B ENST00000487125 non_coding_transcript_exon_variant . rs1131695228 7 100244267 C T ACTL6B ENST00000487225 downstream_gene_variant . rs1131695228 7 100244267 C T ACTL6B ENST00000489904 downstream_gene_variant . rs1131695228
* data have been randomised
2) It is possible to firstly extract stringent_filter pass variants and pipe this into +split-vep then write results as a tab-delimited text file of variants with specified genomic annotation:
#!/bin/bash module load bcftools/1.10.2 bcftools view -i 'DE_NOVO_FLAG="DENOVO"' 00004-RTD_10001.denovo.reheader.vcf.gz |\ bcftools +split-vep -d -f '%CHROM\t%POS\t%REF\t%ALT\t%SYMBOL\t%Feature\t%Consequence\t%gnomAD_NFE_AF\t%Existing_variation\n' -i 'SYMBOL="ACTL6B"' > 00004-RTD_10001_denovo_gene_only.tsv
3) One can also perform more complex filtering such as scanning for all stringent_filter pass variants, that are missense or worse (-s and -S flags) and rare in gnomAD (use the -c flag for numeric conversion). Write results as a tab-delimited text file of variants with specified genomic annotation. The sample id is also printed here so that it can be joined to other tables:
#!/bin/bash module load bcftools/1.10.2 bcftools view -i 'DE_NOVO_FLAG="DENOVO"' 00004-RTD_10001.denovo.reheader.vcf.gz |\ bcftools +split-vep -d -f '[%SAMPLE\t%CHROM\t%POS\t%REF\t%ALT\t%DE_NOVO_FLAG\t%SYMBOL\t%Feature\t%Consequence\t%gnomAD_NFE_AF\t%Existing_variation\n]' \ -c SYMBOL,Feature,Consequence,gnomAD_NFE_AF:Float,Existing_variation -i 'gnomAD_NFE_AF<0.001' -s worst:missense+ -S VEP_severity_scale_2020.txt > 00004-RTD_10001_denovo_rare_missense.tsv
Please note that when specifying severity constraints using the -s flag, one should currently specify the severity scale explicitly using the -S flag. This is due to a known bug in bcftools 1.10.x (outdated default severity scale), which will be resolved in bcftools 1.11 - "VEP_severity_scale_2020.txt" is an example file for the current severity scale, which will be used in the relevant examples on this page:
# Default consequence substrings ordered in ascending order by severity. # Consequences with the same severity can be put on the same line in arbitrary order. # See also https://www.ensembl.org/info/genome/variation/prediction/predicted_data.html intergenic feature_truncation feature_elongation regulatory TF_binding_site TFBS downstream upstream non_coding_transcript intron NMD_transcript non_coding_transcript_exon 5_prime_utr 3_prime_utr coding_sequence mature_miRNA stop_retained start_retained synonymous incomplete_terminal_codon splice_region missense inframe protein_altering transcript_amplification exon_loss disruptive start_lost stop_lost stop_gained frameshift splice_acceptor splice_donor transcript_ablation
4) The Bayes Factor for each variant from the Platypus bayesiandenovofilter.py script can also be easily extracted from the INFO field if required, write results as a tab-delimited text file of variants:
#!/bin/bash module load bcftools/1.10.2 bcftools query -f '%CHROM\t%POS\t%REF\t%ALT\t%FILTER\t%INFO/bayesFactor\n' 00004-RTD_10001.denovo.reheader.vcf.gz > 00004-RTD_10001_bayes_factor.tsv
5) Likewise the multidenovo_filter can also be extracted from the INFO field
#!/bin/bash module load bcftools/1.10.2 bcftools query -f '%CHROM\t%POS\t%REF\t%ALT\t%FILTER\t%INFO/multidenovo_filter\n' 00004-RTD_10001.denovo.reheader.vcf.gz > 00004-RTD_10001_multidenovo_filter.tsv
Iterating through the cohort of annotated multi-sample VCFs
There are a few ways to perform queries across the entire cohort to extract DNVs of interest using in a single command. This can be done using an array job; however due to the small size of the annotated multi-sample VCFs, it is also feasible to run a query in a single while loop.
We have deposited a text file for each cohort (GRCh37 & GRCh38) under the folders:
These file contains the full file paths to every family's annotated multi-sample VCF in the cohort. They can be used to iterate through all annotated multi-sample VCF and extract variants of interest. If you want to do this for only a subset of the cohort, then you can use the vcf_path_flagged_denovo column in the denovo_cohort_information LabKey table to fetch the file paths for your custom cohort.
1) In this example, all annotated multi-sample VCFs from the GRCh37 cohort will be interrogated. For each VCF in the list, cohorts_grch37_vcfs.tsv, it will first filter for stringent_filter pass DNVs only (-i 'DE_NOVO_FLAG="DENOVO"'), it will then print out specified genomic annotation that match the criteria: DNV is missense or worse and is in gene IL6. The sample id will also be printed out to join back onto other tables:
#!/bin/bash module load bcftools/1.10.2 while read -r vcf ; do bcftools view -i 'DE_NOVO_FLAG="DENOVO"' $vcf |\ bcftools +split-vep -d -f '[%SAMPLE\t%GT\t%CHROM\t%POS\t%REF\t%ALT\t%SYMBOL\t%Feature\t%Consequence\t%gnomAD_NFE_AF\t%Existing_variation\n]' \ -i 'SYMBOL="IL6"' -s worst:missense+ -S VEP_severity_scale_2020.txt ; done < /gel_data_resources/main_programme/denovo_variant_dataset/GRCh37/20200326/cohort/grch37_cohort_1762.tsv >> output.tsv
Performing custom filtering
One may want to perform DNV flagging based on custom filters, starting with all Mendelian inconsistencies for all families. This is possible by extracting the necessary fields from the VCF such as coverage, number of reads supporting the variant, and genotype. The below example shows how to extract these values from the VCF for each sample which can then be later processed in R/Python etc to develop the custom filters. The LabKey table denovo_cohort_information can then be used to associate the VCF with the correct trio. Please see the VCF header for a description of each attribute within the VCFs.
#!/bin/bash module load bcftools/1.10.2 bcftools query -f '%CHROM\t%POS\t%REF\t%ALT\t[%GT\t%NR\t%NV\t]\n' 00058-RTD_10127.denovo.reheader.vcf.gz > 00058-RTD_10127_attributes.tsv
The two LabKey tables denovo_cohort_information and denovo_flagged_variants can be queried though either the LabKey desktop application or through the LabKey APIs. Please see the documentation for the LabKey Desktop application and the APIs.
The desktop application allows users to browse the data and perform basic filtering on an individual table. Complete tables or filtered tables can then be exported as text or .xlsx files for downstream analysis. Alternatively, users can interrogate the tables using the LabKey APIs (written in multiple languages including R and Python). We recommend using the LabKey APIs as one can perform more complex and easily reproducible queries on the data.
Examples of how to query the denovo_cohort_information and denovo_flagged_variants table using the R LabKey API (Rlabkey package) are listed here. We make use of the SQL like syntax to query the tables. You can use RStudio or R in the terminal to run these examples.
Fetching entire tables
1) These examples show how to fetch an entire table and save as an object in your R session. Tables can then be post-filtered in R. Both tables are fetched in this example.
library(Rlabkey) labkey.setDefaults(baseUrl = "https://labkey-embassy.gel.zone/labkey/") main_programme <- "/main-programme/main-programme_v9_2020-04-02" denovo_cohort_information <- labkey.executeSql( schemaName ="lists", colNameOpt ="rname", maxRows = 100000000, folderPath = main_programme, sql = "SELECT * FROM denovo_cohort_information" ) denovo_flagged_variants <- labkey.executeSql( schemaName ="lists", colNameOpt ="rname", maxRows = 100000000, folderPath = main_programme, sql = "SELECT * FROM denovo_flagged_variants" )
Filtering tables and selecting columns
1) Tables can be filtered for certain attributes and specified columns selected in Rlabkey using SQL. In this example, we can filter the denovo_flagged_variants table for variants that pass the stringent_filter in a particular family. We then filter for a certain variant of interest, by chromosome and position. The trio_id, chrom, position, reference, alternate columns are then returned.
library(Rlabkey) labkey.setDefaults(baseUrl = "https://labkey-embassy.gel.zone/labkey/") main_programme <- "/main-programme/main-programme_v9_2020-04-02" denovo_cohort_information <- labkey.executeSql( schemaName ="lists", colNameOpt ="rname", maxRows = 100000000, folderPath = main_programme, sql = "SELECT trio_id, chrom, position, reference, alternate FROM denovo_flagged_variants WHERE trio_id = '00004-RTD_10001.1' AND stringent_filter = '1' AND chrom = '1' AND position = '101888210'" )
2) One can also perform a query across the entire cohort to find DNVs in a particular gene of interest. In this example, the denovo_flagged_variants table if firstly filtered for the GRCh38 cohort and for a chromosomal range representing the coordinates of the gene. The variants are filtered for stringent_filter pass and the the trio_id, chrom, position, reference, alternate columns are returned.
library(Rlabkey) labkey.setDefaults(baseUrl = "https://labkey-embassy.gel.zone/labkey/") main_programme <- "/main-programme/main-programme_v9_2020-04-02" denovo_cohort_information <- labkey.executeSql( schemaName ="lists", colNameOpt ="rname", maxRows = 100000000, folderPath = main_programme, sql = "SELECT trio_id, chrom, position, reference, alternate FROM denovo_flagged_variants WHERE assembly = 'GRCh38' AND stringent_filter = '1' AND chrom = '7' AND position >= '50303465' AND position <= '50405101'" )
Joining multiple LabKey tables together
1) The usefulness of the APIs comes when joining multiple tables together. In this example, we perform the query above to find stringent_filter pass DNVs in a gene of interest from the denovo_flagged_variants table. We then bind these results together with useful columns of the denovo_cohort_information table to fetch the offspring's: sex, affection_status, and file path to the annotated multi-sample VCF.
library(Rlabkey) labkey.setDefaults(baseUrl = "https://labkey-embassy.gel.zone/labkey/") main_programme <- "/main-programme/main-programme_v9_2020-04-02" denovo_cohort_information <- labkey.executeSql( schemaName ="lists", colNameOpt ="rname", maxRows = 100000000, folderPath = main_programme, sql = "SELECT ci.participant_id, ci.trio_id, ci.family_id, ci.sex, ci.affection_status, ci.vcf_path_flagged_denovo, fv.chrom, fv.position, fv.reference, fv.alternate FROM denovo_flagged_variants AS fv LEFT JOIN denovo_cohort_information AS ci ON fv.trio_id = ci.trio_id WHERE fv.assembly = 'GRCh38' AND fv.stringent_filter = '1' AND fv.chrom = '7' AND fv.position >= '50303465' AND fv.position <= '50405101' AND ci.member = 'Offspring'" )
2) The remaining phenotype tables can then be joined onto the filtered DNV dataset created above. We now bind the participant's recruited disease using the normalised_specific_disease column from the rare_diseases_participant_disease table.
library(Rlabkey) labkey.setDefaults(baseUrl = "https://labkey-embassy.gel.zone/labkey/") main_programme <- "/main-programme/main-programme_v9_2020-04-02" denovo_cohort_information <- labkey.executeSql( schemaName ="lists", colNameOpt ="rname", maxRows = 100000000, folderPath = main_programme, sql = "SELECT ci.participant_id, ci.trio_id, ci.family_id, ci.sex, ci.affection_status, ci.vcf_path_flagged_denovo, rd.normalised_specific_disease, fv.chrom, fv.position, fv.reference, fv.alternate FROM denovo_flagged_variants AS fv LEFT JOIN denovo_cohort_information AS ci ON fv.trio_id = ci.trio_id LEFT JOIN rare_diseases_participant_disease AS rd ON ci.participant_id = rd.participant_id WHERE fv.assembly = 'GRCh38' AND fv.stringent_filter = '1' AND fv.chrom = '7' AND fv.position >= '50303465' AND fv.position <= '50405101' AND ci.member = 'Offspring'" )
This section attempts to describe additional information that may be required for certain analyses as well as a general FAQ. We will be updating this section with feedback form internal and external users.
More questions will be answered here.
The commit version used for the Platypus bayesiandenovofilter.py python script: 414ca566f8b2269d0caae726b21f87e03783ca1a (latest at time of analysis).
There are slight differences in how DNVs are tiered in the Interpretation Pipeline (see the Rare Disease Results Guide), which are summarised here:
This may lead to discrepancies in the tiered variants observed in the tiering_data LabKey table. Note that the gene panels applied to the family will determine the tier.
All variants in the annotated multi-sample VCFs were genomically annotated using Ensembl Variant Effect Predictor (VEP) version 98. A FASTA file per genome assembly was provided to allow for HGVS annotations. Variants were annotated using the --everything flag. An additional custom VCF from Clinvar (version 20191105) was used for detailed Clinvar annotation.
There is no genomic annotation in the denovo_flagged_variants LabKey table because variant annotation leads to large amount of row duplication in tables. This is because each variant is annotated against every transcript of the effect gene. Therefore a single variant will be represented on many rows (perhaps more than 10). As the denovo_flagged_variants table is already large, it would not have been feasible to do. The full, comprehensive annotation can be found in the annotated multi-sample VCFs as described as queried by using the example scrips in the Code Book.
This is likely to be because at the time of generating the dataset, the family had not yet successfully run through the Rare Disease Interpretation Pipeline. Only families that have run through the pipeline are considered for DNV analysis as these families have passed the relevant genomic vs reported checks (including Mendelian inconsistencies, identity by descent, reported vs genetic sex checks). The DNV dataset will be updated to include these families when they have run through the pipeline.
It is key to remember that the variants in this dataset are called by the Platypus Variant caller and not the Starling small variant caller by Illumina. The genomes in the folder /genomes/by_date/... are called by the Illumina Starling variant caller which may explain why certain variants are not identical between callers. We plan to make available all of the multi-sample Platypus VCFs joint-called by family in the Research Environment as soon as possible.
Please see the description below explaining why the Bayes Factor is nullified for a subset of variants found in families containing nested trios. These variants are flagged using the multidenovo_filter.
You can ignore the warning below, it has no effect on the query. It is a carry over from the original Platypus variant calls.
[W::bcf_hdr_check_sanity] GL should be declared as Number=G
The samples in the multi-sample VCF are not always in the same order (i.e. not standardised in the order of Father, Mother, Offspring). The two files shown below show the order of the samples within the VCFs:
These tell you the VCF filename, the sample IDs in each VCF, the pedigree (offspring, mother, father), and the position of the sample in the VCF (1, 2, 3, n).
You can use these files alongside bcftools to reorder the VCFs before you do any processing, such as as example for a single VCF below:
bcftools view -s $(grep 00003-RTD_10028.denovo.reheader.vcf.gz ../vcf_order/sample_order_in_vcf_grch38.tsv | sort -k3 | cut -f 2 | paste -sd ",") 00003-RTD_10028.denovo.reheader.vcf.gz
This will rearrange the VCF so that the samples are always ordered by Father, Mother, Offspring. Remember that for families with nested trios this is slightly more complicated – as they have multiple offspring.
More explanations will be populated here.
For families containing nested trios, the Platypus bayesiandenovofilter.py writes a separate line for each Mendelian inconsistency per trio. Therefore, duplicate variant lines exist in the original output VCF from the bayesiandenovofilter.py script if the same Mendelian inconsistency is identified in more than one trio. This causes issues for the Bayes Factor as this is written to the INFO filed of the VCF which is associated with the variant and not the sample. Due to this, it is not possible to attribute the correct Bayes Factor value in the INFO field to the correct sample in the FORMAT field.
For these duplicate lines - caused by an identical Mendelian inconsistency identified within more than one trio of the same family - we nullify the Bayes Factor value within the INFO field ("."). This is so that the Bayes Factor is not associated with the incorrect sample. These variants are also flagged with the multidenovo_filter in the INFO field (0: multidenovo; 1: not multidenovo). In the annotated multi-sample VCFs, we also remove redundant duplicate variant lines that are written by the bayesiandenovofilter.py script.
Note that simple trio families are not affected by this complication.
The Bayes Factor for the families with nested trios can be manually calculated easily using the log in the Platypus bayesiandenovofilter.py script (lines 474-497) which uses the GL field from the FORMAT attribute (Genotype log10-likelihoods for AA,AB and BB genotypes, where A = reference allele and B = alternative allele). As stated, the Bayes Factor is not used as a filter in the DNV dataset.
In the original Platypus multi-sample VCF called by family (Step 3), variants on the X-chromosome are called as diploid (0/0, 0/1, 1,1) across the entire chromosome - regardless of if the variant is in the PAR or non-PAR. The bayesianDeNovoFilter.py python script then recodes these genotypes as haploid (0, 1) if:
This logic can can be seen in lines 187-213. If these criteria are met, then the variant is translated from diploid (0/1, 1,1) to haploid (1). If these criteria are not met, the variant is translated in from diploid (0/1, 1,1) to haploid (0). Note that the PARs are not differentiated in the bayesianDeNovoFilter.py python script (PARs for males on the X-chromosome should be diploid). Genotypes are only recoded for males on the X-chromosome. This explains the logic used in the zygosity_filter for males on the X-chromosome in the PAR and non-PAR.
The Platypus bayesiandenovofilter.py script is not fully configured to discriminate variants in the PAR and non-PAR. In the script, the ploidy is set in lines 47-69. As is shown, for males on the X-chromosome, all variants are treated as haploid. Please consider this when attempting to analyse variants on the X-chromosome PARs for males.