Page tree
Skip to end of metadata
Go to start of metadata

1. Overview

This page details the design and implementation for sample-level quality control (QC) for aggV2. The aggregation comprises 78,195 germline samples (rare disease and cancer germline), built from gVCFs of the Version 10 Main Programme (MP) Data Release for Research. The sample QC strategy used is based on similar approaches applied on other large scale cohorts (e.g. gnomAD, TOPMed) and on our own observations and investigations on previous aggregations at Genomics England. 

2. Accompanying LabKey Table

A LabKey table 'aggregate_gvcf_sample_stats ', provided as a companion to aggV2, contains quality statistics for all samples in aggV2. Please see the Data Dictionary, which can be found here under the respective release, for a description of each column. 

3. Sample consistency

All 78,195 samples in aggV2 were sequenced with 150bp paired-end reads in a single lane of an Illumina HiSeq X instrument and uniformly processed on the Illumina North Star Version 4 Whole Genome Sequencing Workflow (NSV4, version; which comprises the iSAAC Aligner (version and Starling Small Variant Caller (version 2.4.7). Samples were aligned to the Homo Sapiens NCBI GRCh38 assembly with decoys. The contractual obligations from Illumina describe that germline samples must be sequenced to produce at least 85Gb of sequence data with sequencing quality of at least 30. Alignments must cover at least 95% of the genome at 15X or above with well mapped reads (mapping quality >10) after discarding duplicates. Genomes are not delivered to Genomics England if they do not fulfil these obligations.

Additionally, all samples in aggV2 were prepared in the same way. DNA was extracted and processed based on the Genomics England Sample Handling Guidelines. DNA samples were received in FluidX tubes (Brooks) and accessioned into Laboratory Management Information System (LIMS) at UK Biocentre ((Milton Keynes). DNA samples undergo quality control assessment (concentration, volume, purity and degradation) at UK Biocentre. Upon sample receipt they were visually inspected to ensure the volume was aligned with expectations and a concentration assessment carried out using a dsDNA-specific method. DNA samples in the concentration range of 10 ng/ul to 100 ng/ul (Picogreen) were accepted into the sequencing workflows. Following automated library preparation, libraries were quantified using automated qPCR, clustered and sequenced. Libraries were prepared using the Illumina TruSeq DNA PCR-Free High Throughput Sample Preparation kit or the Illumina TruSeq Nano High Throughput Sample Preparation kit.

Sample Source

Not all samples are from the same tissue source. Please see below for a detailed breakdown by source. 

4. Participant & Genome Inclusion Criteria

  • Only Main Programme V10 participants are included
  • Participant 'Programme Consent Status' for Research must equal 'Consenting’ from the participant table (this removes samples which are not available to all researchers from the aggregation, such as samples for the Tracer-X participants).
  • For known duplicated participants, the participant ID by latest genome delivery by date was used
  • The genome platekey ID must exist in the MP V10 plated_samples table
  • The genome delivery type / study ID must be in both:
    • Bertha Web (Genomics England Internal Bioinformatic Pipeline): rare disease germline | cancer germline
    • OpenCGA catalog (Genomics England Internal Bioinformatic Pipeline): RD38 | CG38
  • The genome assembly must be:
    • Bertha Web (Genomics England Internal Bioinformatic Pipeline): GRCh38 (Illumina NSV4 Pipeline)
  • The genome delivery from Illumina must include at least the BAM and the gVCF file
  • Genomes with no contamination information are removed
  • Genomes with no samtools statistics from Bertha are removed (these have not completed the pipeline fully)
  • Genomes with sample type ‘tumour’ are removed

5. Sample QC Filters

All 78,195 samples included in aggV2 pass the following quality control filters:

Sample AttributeDescription
Sample Contamination (freemix)less than 0.03
Ratio of SNV Het to Hom callsless than 3
Total Number of SNVsbetween 3.2M - 4.7M
Array Concordancegreater than 90%
Median Fragment Sizegreater than 250bp
Excess of Chimeric Readsless than 5%
Percentage of Mapped Readsgreater than 60%
Percentage AT Dropoutless than 10%

6. Sample source & Library Preparation 

The overwhelming majority of aggV2 are blood samples prepared used EDTA and TruSeq PCR-Free High Throughput (~99% of samples). A small fraction however are prepared from different tissue and library types.  

sample sourcesample preparation methodsample library typenumber of samplespercent of samples
BLOODEDTATruSeq PCR-Free High Throughput77,21298.74
SALIVAORAGENETruSeq PCR-Free High Throughput7060.90
TISSUEFFTruSeq PCR-Free High Throughput1220.16
FIBROBLASTFFTruSeq PCR-Free High Throughput720.09
BLOODEDTATruSeq Nano High Throughput500.06
GERMLINEFFTruSeq PCR-Free High Throughput330.04

We notice that saliva samples tend to have lower quality than blood ones for some metrics, as shown below for the percent of AT drop out vs percent of aligned reads. Please make sure you take this into account for your analysis where relevant. Use the LabKey table 'aggregate_gvcf_sample_stats' to determine the sample source, preparation method, and library type. 

7. Coverage Summary

The below graphs and tables describe the coverage summary statistics for the 78,195 samples in aggV2. 

7.1. Percent coverage at 15X

Min.1st Qu.MedianMean3rd Qu.Max.

7.2. Mean genome-wide coverage

Min.1st Qu.MedianMean3rd Qu.Max.

8. Help & Support

Help with aggV2

Please reach out via the Genomics England Service Desk for any issues related to the aggV2 aggregation or companion datasets, including "aggV2" in the title / description of your inquiry.