We provide a set of PASS variants indicating they meet certain quality thresholds (see Site QC metrics). Site quality control (QC) is applied as a flag to the aggV2 data across all autosomes and chromosome X, with pass variants listed as 'PASS' in the FILTER field of the VCF. All variants are listed in the aggregate files provided, regardless of FILTER status. Variants failing on any of our metrics will have an associated flag in the FILTER field, indicating the filter(s) that they failed (separated by colons). The actual values for each metric, together with some additional ones, are stored in the INFO field of the respective variant.
It is important to note that multi-allelic variants are decomposed using vt, as described here. This means that multi-allelic SNPs will have a biallelic representation, and that the corresponding site metrics are allele (or row) specific. All numbers presented hence describe biallelic representations of variants.
Across the autosomes, our dataset comprises 722,342,407 variants, of which 540,098,760 (74.8%) pass our site QC metrics.
2. Site QC of the autosomes:
The flags are presented within the FILTER column of the multi-sample VCF files and the annotation files as follows:
All filters passed
Missingness (fully missing genotypes with DP=0) ≤ 5%
Median Depth ≥ 10
|GQ||Median GQ ≥ 15|
|ABratio||Percentage of het calls not showing significant allele imbalance for reads supporting the ref and alt alleles ≥ 25%|
|completeGTRatio||Percentage of complete sites (sites with no missing data) ≥ 50%|
|phwe_eur||mid p-value for deviations from HWE in unrelated samples of inferred European ancestry ≥ 1e-5|
Sites failing any of the above criteria will have the failing criterion/criteria listed in the FILTER field in place of a 'PASS' flag.
For more detail into how the metrics were calculated, please see the INFO field information below.
2.1. Schematic of the site QC pipeline
Below is a schematic of the site QC as part of the aggregation process. This serves as an overview of the different processes in site QC. All QC metrics, and INFO field metrics are enumerated, and a broad overview of the process to infer relatedness and ancestry is also shown.
2.2. Site QC statistics
The overall number of variants and PASS variants across autosomes is shown below.
The data are presented in 5 different splits defined as:
- All - all SNPs and INDELs (decomposed into their biallelic representation)
- Biallelic indels - Insertions or deletions where there is only allele in our dataset
- All indels - Insertions or deletions with one or more ALT alleles (decomposed into their biallelic representation)
- Biallelic SNPs - SNPs with only one allele in our dataset
- All SNPs - SNPs with one or more ALT alleles (decomposed into their biallelic representation)
|Data||N||of which PASS||% (PASS)|
The reduced rates of PASS variants across all SNPs and Indels are to an extent an outcome of the decomposition of multi-allelic variants to their bialellic representations. Full descriptions of site representations can be found here.
2.3. Chromosome specific pass rates
Below is a plot showing the percentage of PASS variants per chromosome (autosomes only). Variants are split as described here.
3. INFO field
Per variant quality metrics were calculated and populated in the INFO field of the multi-sample VCF files and the annotation files. The INFO tags with descriptions are as enumerated in the table below.
|Metric TAG||INFO description||Used for FILTER field calculation||Further Description|
Median depth (taken from the DP FORMAT field) of all samples
Median depth (taken from the DP FORMAT field) from samples with non-missing genotypes
The median depth was calculated from GTs in which partially or fully missing genotypes were filtered out. This is included as an INFO field metric only. Values are capped at 99.
bcftools query $infile -e 'GT~"\."' [...]
Median genotype quality(taken from the GQ FORMAT field) from samples with non-missing genotypes
The median GQ was calculated from GTs in which partially or fully missing genotypes were filtered out. Values are capped at 99.
bcftools query $infile -e 'GT~"\."' [...]
Ratio of fully missing genotypes ( (GT = './.' and DP = 0)
|Y||See here for more information|
|completeSites||The ratio of complete GTs/total number of samples||Y||As we decompose multi-allelic variants into biallelic representations, minor allele genotypes may be largely composed of half missing genotypes ('./1|./0|1/.|0/.'). The complete GT ratio informs you of how many of the genotypes for the allele are not half-missing. We use a cut-off of 0.5, indicating that at least half of the genotypes are complete.|
|ABratio||For each het call, a binomial test is conducted for reads supporting the ref and alt alleles. AB ratio is the hets showing imbalance (p<0.01) divided by the total number of hets.||Y||The AB ratio is a measure of the evidence supporting whether a heterzygous call is correct or not. This is achieved by testing the distribution of reads supporting ref and alt alleles for each genotype. We apply a binomial test with a stringent threshold of p-value >0.01. The ratio is number of heterozygous calls passing this test divided by the total number of heterozygous calls for a variant.|
|MendelSite||Number of Mendelian errors at this site from confirmed trios only||N||Site wide Mendelian errors are given as an info field metric. These are calculated using confirmed trios, with trios harbouring excess family-wise Mendelian errors filtered out. Full information on how Mendelian errors are calculated can be found here.|
|phwe_afr||HWE mid p-value in inferred unrelated inferred afr superpop||N||Hardy Weinberg Equilibrium scores are calculated using inferred unrelated inferred super-population (based on the 1000 genomes) groups with a threshold of >1e-5 (mid-p value). Due to the large number of inferred Europeans within our dataset, we use this for the filter field. Anyone studying other super-populations may wish to choose the relevant super-population p-value. |
|phwe_amr||HWE mid p-value in inferred unrelated inferred amr superpop||N|
|phwe_eas||HWE mid p-value in inferred unrelated inferred eas superpop||N|
|phwe_eur||HWE mid p-value in inferred unrelated inferred eur superpop||Y|
|phwe_sas||HWE mid p-value in inferred unrelated inferred sas superpop||N|
|AN||Total number of alleles in called genotypes||N||These values are all calculated using the BCFtools plugin fill-tags|
|AC||Allele count in genotypes||N|
|AC_Hom||Allele counts in homozygous genotypes||N|
|AC_Het||Allele counts in heterozygous genotypes||N|
|AC_Hemi||Allele counts in hemizygous genotypes||N|
3.1. Missingness and completeness
Due to the decomposition of multi-allelic variants with vt and the resulting variant representation, some of the samples will have partially or completely missing genotype data (e.g. ".", "0/." or "./.") without data being truly missing for that sample at that locus. For instance, a sample with genotype TT for the multiallelic variant A/C/T will have missing ("./.") genotype for the A/C variant in its bi-allelic representation, but have 1/1 for the A/T bi-allelic representation. Both of the aforementioned representations will be present in the final aggV2 file, in separate rows. To distinguish this from a truly missing genotype, we have introduced the concept of completeness. Missingness will count truly missing sites (with depth of 0) while completeness will indicate the percentage of samples with complete (0/0, 0/1, 1/1) genotype data for that variant. In the example of the sample with genotype TT above, it will be not counted as missing but it will not be counted as complete either. Low completeness and low missingness for a variant will hence often indicate a variant where the respective alt allele is rare, in a decomposed multi-allelic site.
4. Site QC of the X chromosome
Sex chromosome QC was handled in a similar manner to autosomal QC, however input files were split into male and female specific subsets which were analysed separately. This means that a PASS variant on chromosome X passes the same thresholds as any autosomal variant. Sex was determined on the basis of the Illumina ploidy data, with non-ambiguous XX and XY calls samples used to create the female and male subsets respectively. The files containing these data are available at:
Site QC was run using data on:
- 40,653 XX females
- 35,822 XY males
Samples of ambiguous ploidy were not included for the site QC calculations (although their data are still available within the aggregate).
The pass metrics were the same as for the autosomes for both male and female subsets in pseudo-autosomal regions (PAR). Non-PAR and PAR region site metric cut-offs are displayed in the table below:
|Mendel Errors||Same as autosome (INFO)||Same as autosome (INFO)||Same as autosome (INFO)||Same as autosome (INFO)|
|pHWE||Same as autosome||NA||Same as autosome||NA|
Chrom X PASS variants are assessed according to their values in female samples only.
If you wish to filter for PASS variants in both male and females, then use the -i flag in BCFtools to filter for variants for which INFO/FILTER_m is PASS_m
INFO field data are presented in the same way as for the autosomes. Metrics with a '_m' suffix refer to males, and metrics with no suffix refer to females values. The exception to this is the AC, AN, AC_Hom, AC_Het, and AC_Hemi tags, which are presented as with a '_f' suffix for XX females, a '_m' suffix for XY males, and no suffix for all data.
|Variant type||Sex||N total||N pass||% pass|
5. Mendelian Inconsistencies
The availability of family data in the Genomics England dataset allows us to calculate variant-level Mendelian inconsistencies as an additional metric that can be used for QC purposes. We used over 10,000 trios to calculate Mendel inconsistencies as follows:
- Trios were defined from extended family structures such that:
- Cases of suspected uniparental disomy were filtered out
- Each individual is only present in a single trio
- All members of the trio were consented for data release V9 (not withdrawn)
- Where multiple trios were present in a family (e.g. Mother, Father, Proband, and Mother, Father, Sibling), the trio containing the proband was kept
- Family-wide Mendelian inconsistency rates were calculated across the defined trios
- Families falling outside of the acceptable range of family-wide Mendel errors (mean, +/- 4 standard deviations) were filtered out
- Site specific Mendelian inconsistencies were then calculated across all trios not filtered out in the previous step.
Site specific Mendel inconsistency counts are provided in the INFO field.
6. Help & Support
Help with aggV2