Our documentation pages have moved. Please go to https://research-help.genomicsengland.co.uk/ to navigate to the most up-to-date version of this page.
aggV1 will be deprecated by aggV2. Unless an analysis using aggV1 is ongoing, aggV2 should be used in its place by all researchers. See here Aggregated Variant Calls (aggV2) for documentation.
The aggregate_gvcf_sample_stats LabKey table from V10 onwards (including V10) points to the data in aggV2
We have generated multi-sample gVCF files containing germline variant calls from 59,464 participants from Main Programme Release 5.1 (20/11/2018). The aggregated files contain samples from both the rare disease and the cancer programs. Only genomes on build GRCh38 are included. Due to its large size, the aggregated dataset is split into 944 pieces (which we call chunks) for easier handling.
Documentation detailing the aggregation can is attached here: Aggregated_gVCF_dataset_v1_2019_02_28_README.pdf
The aggregated dataset alongside with a more detailed description is stored here: /gel_data_resources/main_programme/aggregated_illumina_gvcf/GRCH38/20190228/
All included samples have passed this set of basic QC metrics:
- Cross-contamination < 5%
- Mapping rate > 75%
- Mean sample coverage > 20X
- Insert size < 250
These QC metrics are given in the LabKey table aggregate_gvcf_sample_stats in the latest labkey release.
This aggregate dataset contains information on a subset of participants who have since been withdrawn from research. These cannot be used in any new analyses. Thus, it is extremely important to remove these samples from your analyses.
For the main programme version 8 data release, the list of consented samples are detailed in the file main_programme_v8_samples.txt, located in the folder /gel_data_resources/main_programme/aggregated_illumina_gvcf/GRCH38/20190228/docs/
They can also be found in the 'aggregate_gvcf_sample_stats' table in the labkey, for the latest data release.
To filter the aggregate to these samples, all bcftools commands should include the flag -S main_programme_v8_samples.txt
Submit a ticket to the Genomics England Service desk if you are unsure of how to filter the dataset for any other use.
Sample quality control
An overview of the initial QC is given below. This QC was conducted prior to the data aggregation. Individual sample QC data was retrieved from the Genomics England Catalog Database. Most metrics are BAM file statistics provided from Illumina or from the Genomics England bioinformatic pipeline. NB: Genetic vs Reported sex checking is done as part of the Genomics England Bioinformatics Interpretation Pipeline and has been completed in approximately 50% of the aggregated data set so far.
Number of failed samples
Insert size medium (bp)
Chimeric DNA fragments (%)
Genetic vs reported sex checks
Genome-wide coverage mean
Sample swaps (tumour)
Samples failed in 30k merge
Samples failed 60k merge
gVCF aggregation process
Individual gVCF files were firstly broken down into 355 chunks for easier handling due to large file sizes. To cut the genome into smaller regions we used a file made available by the Broad Institute; where region boundaries are defined by a high fraction of missing base-calls (N). This way there is a low risk of losing indels that could span two regions in the aggregation process.
We use Illumina’s agg software to produce multi-sample VCFs (https://github.com/Illumina/agg). agg handles multi-allelic nucleotide polymorphisms (MNPs) in the following manner:
“We decompose MNPs (Multi-allelic Nucleotide polymorphisms) and perform basic left-shifting/trimming of indels (taken from `bcftools norm` implementation). We apply the complex substitution decomposition routine implemented in vt (http://genome.sph.umich.edu/wiki/Vt), but as noted on their webpage, there is no unique solution for this problem. Multi-allelic Variants are written as one allele per line in the VCF file.”
The dataset contains 677,003,512 variants (SNPs and short INDELs) for 59,464 participants.
We calculate the following QC metrics for each variant and them as an INFO field for each site:
- INFO/missing: Missingness per site (using VCFtools version 0.1.15)
- INFO/meanDP: Mean coverage per site (using VCFtools version 0.1.15)
INFO/meanGQ: Mean GQ per site (using BCFtools version 1.9)
INFO/AB_ratio: Allelic imbalance: Fraction of heterozygotes alleles that passed a binomial test for allelic depth; AB-ratio = (the number of sites that passed AB test / the number of heterozygote alleles). The binomial test was calculated in BCFtools version 1.9; Binomial test P-value threshold = 0.01
INFO/inbreed: Inbreeding coefficient: The F-statistic of expected heterozygosity was calculated with the following formula: F=1-(observed frequency(heterozygote)) / expected frequency(heterozygote)). The expected and observed heterozygote frequencies were retrieved with plink1.9.
INFO/MErrRatio: The fraction of genotypes with Mendelian errors. The ratio is calculated as the Mendel Errors/allele count in a filtered subset of trio probands. The number of alleles with Mendel Errors was calculated with plink (PLINK version 1.9; function --mendel summaries-only). The filtered trios are given in the LabKey table aggregate_gvcf_sample_stats. Monomorphic alleles in trio probands were set to missing (assigned a dot), whereas 0 indicates 0 Mendel Errors. We counted the minor alleles for the probands (using VCFtools version 0.1.15).
INFO/MErr: Genotypes with Mendelian errors. This is the same as point 6, but reports the total number of Mendelian errors per site instead of the fraction of Mendelian errors relative to the MAC in a filtered subset of trio probands.
To obtain INFO fields 6, 7 above we only used a subset of the trios from our data, for which the total number of Mendel errors across the family, for a random subset of the genomic chunks, was within 4 standard deviations from the mean number of errors across all trios. The Mendel errors were calculated with PLINK (version 1.9; function --mendel )
Additionally, we set the VCF FILTER field to PASS if all of the following are true:
Missingness < 5%
Coverage >= 10
Fraction of variants passed allelic imbalance test >= 0.25
Example FILTER fields for sites that fail one of the checks are detailed below:
call-rate if a site has missingness > 5%
coverage if coverage is < 10
GQ if GQ<15
ABratio if < 25% of heterozygote genotypes at a specific site fail the allelic imbalance test
callrate:coverage:GQ:ABratio if a site fails all checks We recommend using only variants that have a PASS filter
No variant QC filters were applied in the dataset, but the VCF filter was set to PASS for variants which passed GQ, DP, missingess, allelic imbalance, and Mendel error filters. We recommend only using variants that have PASS in the filter column in your analyses.
The final gVCF aggregated dataset is located here:
Large chunks were broken down further after aggregation, to even out waiting times when working on the dataset. This resulted in 944 aggregated chunks in total.
Files are named in the following format: 60k_GRCH38_germline_mergedgVCF_chr[start-pos]_[end-pos].bcf where chr, start-pos, end-pos are the chromosome, starting and ending positions of the variants contained in the respective chunk.
Specific file names and regions in bed format can be found in this tab delimited file:
To view variants that PASS or pass Missingness and DP but fail GQ and ABratio for a specific region:
This will include all sites where the Filter field is PASS, ABratio,GQ or GQ:ABratio.
To extract some information from the aggregated dataset:
Filter levels are: PASS, callrate, coverage, GQ, ABratio, callrate:coverage, callrate:GQ,coverage:GQ,callrate:ABratio, coverage:ABratio,GQ:ABratio, callrate:coverage:ABratio, callrate:coverage:GQ, callrate:GQ:ABratio, coverage:GQ:ABratio, callrate:coverage:GQ:ABratio.