Page tree
Skip to end of metadata
Go to start of metadata



1. Overview

We provide a set of PASS variants indicating they meet certain quality thresholds (see Site QC metrics). Site quality control (QC) is applied as a flag to the aggV2 data across all autosomes and chromosome X, with pass variants listed as 'PASS' in the FILTER field of the VCF. All variants are listed in the aggregate files provided, regardless of FILTER status. Variants failing on any of our metrics will have an associated flag in the FILTER field, indicating the filter(s) that they failed (separated by colons). The actual values for each metric, together with some additional ones, are stored in the INFO field of the respective variant. 

It is important to note that multi-allelic variants are decomposed using vt, as described here. This means that multi-allelic SNPs will have a biallelic representation, and that the corresponding site metrics are allele (or row) specific. All numbers presented hence describe biallelic representations of variants.

Across the autosomes, our dataset comprises 722,342,407 variants, of which 540,098,760 (74.8%) pass our site QC metrics.

2. Site QC of the autosomes:

The flags are presented within the FILTER column of the multi-sample VCF files and the annotation files as follows:

FILTER TAGDescription
PASS

All filters passed

missingness

Missingness (fully missing genotypes with DP=0) ≤ 5%

depth

Median Depth ≥ 10

GQMedian GQ ≥ 15
ABratioPercentage of het calls not showing significant allele imbalance for reads supporting the ref and alt alleles ≥ 25%
completeGTRatioPercentage of complete sites (sites with no missing data) ≥ 50%
phwe_eurmid p-value for deviations from HWE in unrelated samples of inferred European ancestry ≥ 1e-5

Sites failing any of the above criteria will have the failing criterion/criteria listed in the FILTER field in place of a 'PASS' flag.

For more detail into how the metrics were calculated, please see the INFO field information below.

2.1. Schematic of the site QC pipeline

Below is a schematic of the site QC as part of the aggregation process. This serves as an overview of the different processes in site QC. All QC metrics, and INFO field metrics are enumerated, and a broad overview of the process to infer relatedness and ancestry is also shown.

2.2. Site QC statistics

The overall number of variants and PASS variants across autosomes is shown below. 

The data are presented in 5 different splits defined as:

  1. All - all SNPs and INDELs (decomposed into their biallelic representation)
  2. Biallelic indels - Insertions or deletions where there is only allele in our dataset
  3. All indels - Insertions or deletions with one or more ALT alleles (decomposed into their biallelic representation)
  4. Biallelic SNPs - SNPs with only one allele in our dataset
  5. All SNPs - SNPs with one or more ALT alleles (decomposed into their biallelic representation)
Dataof which PASS% (PASS)
All72234240754009876074.8
Biallelic Indels378456563310170587.5
All Indels913744946295147768.9
Biallelic SNPs40028736238276904095.6
All SNPs63096791047796505775.8

The reduced rates of PASS variants across all SNPs and Indels are to an extent an outcome of the decomposition of multi-allelic variants to their bialellic representations. Full descriptions of site representations can be found here.

2.3. Chromosome specific pass rates


Below is a plot showing the percentage of PASS variants per chromosome (autosomes only). Variants are split as described here.




3. INFO field

Per variant quality metrics were calculated and populated in the INFO field of the multi-sample VCF files and the annotation files. The INFO tags with descriptions are as enumerated in the table below. 

Metric TAGINFO descriptionUsed for FILTER field calculationFurther Description
medianDepthAll

Median depth (taken from the DP FORMAT field) of all samples

Y
medianDepthNonMiss

Median depth (taken from the DP FORMAT field) from samples with non-missing genotypes

N

The median depth was calculated from GTs in which partially or fully missing genotypes were filtered out. This is included as an INFO field metric only. Values are capped at 99.

bcftools query $infile  -e 'GT~"\."' [...]

medianGQ

Median genotype quality(taken from the GQ FORMAT field) from samples with non-missing genotypes

Y

The median GQ was calculated from GTs in which partially or fully missing genotypes were filtered out. Values are capped at 99.

bcftools query $infile  -e 'GT~"\."' [...]

missingness

Ratio of fully missing genotypes ( (GT = './.' and DP = 0)

YSee here for more information
completeSitesThe ratio of complete GTs/total number of samplesYAs we decompose multi-allelic variants into biallelic representations, minor allele genotypes may be largely composed of half missing genotypes ('./1|./0|1/.|0/.'). The complete GT ratio informs you of how many of the genotypes for the allele are not half-missing. We use a cut-off of  0.5, indicating that at least half of the genotypes are complete.
ABratioFor each het call, a binomial test is conducted for reads supporting the ref and alt alleles. AB ratio is the hets showing imbalance (p<0.01) divided by the total number of hets.YThe AB ratio is a measure of the evidence supporting whether a heterzygous call is correct or not. This is achieved by testing the distribution of reads supporting ref and alt alleles for each genotype. We apply a binomial test with a stringent threshold of p-value >0.01. The ratio is number of heterozygous calls passing this test divided by the total number of heterozygous calls for a variant.
MendelSiteNumber of Mendelian errors at this site from confirmed trios onlyNSite wide Mendelian errors are given as an info field metric. These are calculated using confirmed trios, with trios harbouring excess family-wise Mendelian errors filtered out. Full information on how Mendelian errors are calculated can be found here.
phwe_afrHWE mid p-value in inferred unrelated inferred afr superpop NHardy Weinberg Equilibrium scores are calculated using inferred unrelated inferred super-population (based on the 1000 genomes) groups with a threshold of >1e-5 (mid-p value). Due to the large number of inferred Europeans within our dataset, we use this for the filter field. Anyone studying other super-populations may wish to choose the relevant super-population p-value. 



phwe_amrHWE mid p-value in inferred unrelated inferred amr superpop N
phwe_easHWE mid p-value in inferred unrelated inferred eas superpop N
phwe_eurHWE mid p-value in inferred unrelated inferred eur superpop Y
phwe_sasHWE mid p-value in inferred unrelated inferred sas superpop N
ANTotal number of alleles in called genotypesNThese values are all calculated using the BCFtools plugin fill-tags



ACAllele count in genotypesN
AC_HomAllele counts in homozygous genotypesN
AC_HetAllele counts in heterozygous genotypesN
AC_HemiAllele counts in hemizygous genotypesN

3.1. Missingness and completeness

Due to the decomposition of multi-allelic variants with vt and the resulting variant representation, some of the samples will have partially or completely missing genotype data (e.g. ".",  "0/." or "./.") without data being truly missing for that sample at that locus. For instance, a sample with genotype TT for the multiallelic variant A/C/T will have missing ("./.") genotype for the A/C variant in its bi-allelic representation, but have 1/1 for the A/T bi-allelic representation. Both of the aforementioned representations will be present in the final aggV2 file, in separate rows. To distinguish this from a truly missing genotype, we have introduced the concept of completeness. Missingness will count truly missing sites (with depth of 0) while completeness will indicate the percentage of samples with complete (0/0, 0/1, 1/1) genotype data for that variant. In the example of the sample with genotype TT above, it will be not counted as missing but it will not be counted as complete either. Low completeness and low missingness for a variant will hence often indicate a variant where the respective alt allele is rare, in a decomposed multi-allelic site. 

4. Site QC of the X chromosome


Sex chromosome QC was handled in a similar manner to autosomal QC, however input files were split into male and female specific subsets which were analysed separately. This means that a PASS variant on chromosome X passes the same thresholds as any autosomal variant. Sex was determined on the basis of the Illumina ploidy data, with non-ambiguous XX and XY calls samples used to create the female and male subsets respectively. The files containing these data are available at:

/gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/additional_data/sample_sex/


Site QC was run using data on:

  • 40,653 XX females
  • 35,822 XY males

Samples of ambiguous ploidy were not included for the site QC calculations (although their data are still available within the aggregate).

The pass metrics were the same as for the autosomes for both male and female subsets in pseudo-autosomal regions (PAR). Non-PAR and PAR region site metric cut-offs are displayed in the table below:


RegionNon-PARPAR
MetricFemalesMalesFemalesMales
Median depth>10>5>10>10
Median GQ>15>15>15>15
Percent missing<5%<5%<5%<5%
AB Ratio>0.25NA>0.25NA
Mendel ErrorsSame as autosome (INFO)Same as autosome (INFO)Same as autosome (INFO)Same as autosome (INFO)
Complete sites>50%>50%>50%>50%
pHWESame as autosomeNASame as autosomeNA


Chrom X PASS variants are assessed according to their values in female samples only.

If you wish to filter for PASS variants in both male and females, then use the -i flag in BCFtools to filter for variants for which INFO/FILTER_m is PASS_m

INFO field data are presented in the same way as for the autosomes. Metrics with a '_m' suffix refer to males, and metrics with no suffix refer to females values. The exception to this is the AC, AN, AC_Hom, AC_Het, and AC_Hemi tags, which are presented as with a '_f' suffix for XX females, a '_m' suffix for XY males, and no suffix for all data.



Variant typeSexN totalN pass% pass
AllF314430932285498472.7
Biallelic INDELsF1496167132770588.7
All INDELsF3596716261230272.6
Biallelic SNPsF174717391669672595.6
All SNPsF278463772024268272.7


5. Mendelian Inconsistencies

The availability of family data in the Genomics England dataset allows us to calculate variant-level Mendelian inconsistencies as an additional metric that can be used for QC purposes. We used over 10,000 trios to calculate Mendel inconsistencies as follows:

  1. Trios were defined from extended family structures such that:
    1. Cases of suspected uniparental disomy were filtered out 
    2. Each individual is only present in a single trio
    3. All members of the trio were consented for data release V9 (not withdrawn)
    4. Where multiple trios were present in a family (e.g. Mother, Father, Proband, and Mother, Father, Sibling), the trio containing the proband was kept
  2. Family-wide Mendelian inconsistency rates were calculated across the defined trios
  3. Families falling outside of the acceptable range of family-wide Mendel errors (mean, +/- 4 standard deviations) were filtered out
  4. Site specific Mendelian inconsistencies were then calculated across all trios not filtered out in the previous step.

Site specific Mendel inconsistency counts are provided in the INFO field.


6. Help & Support

Help with aggV2

Please reach out via the Genomics England Service Desk for any issues related to the aggV2 aggregation or companion datasets, including "aggV2" in the title / description of your inquiry.