Page tree
Skip to end of metadata
Go to start of metadata

1. Overview

Welcome to the aggV2 Frequently Asked Questions (FAQ) page! This page will be regularly updated with user feedback. 

Help with aggV2

If you have any suggestions for the aggV2 FAQ, please reach out via the Genomics England Service Desk. Include"aggV2" in the title / description of your inquiry. 

2. General

Mapping participant ID to sample ID

(question) The samples in aggV2 are referenced by sample ID in the VCFs. How do I link this to participant ID so that I can use the phenotype data?

(tick) All samples in aggV2 genotype VCFs are referenced by sample ID - which is the platekey ID - of the sequenced genome. This is normally an 'LP' number such as LP3000204-DNA_B11; although there are some samples in aggV2 which do not being with LP. In order to map the sample ID to the participant ID, one can use the aggregate_gvcf_sample_stats LabKey table. This table includes all samples within aggV2 and has one participant ID per row. It contains the participant ID to sample ID map. This way you can join any phenotype data to aggV2 using the participant ID. Have a look here: aggV2 Code Book::Phenotype Queries for an example of how to do this. 


Identifying the correct chunk to use

(question) aggV2 is split into 1,371 chunks across the genome. Is there an easy way I can find the chunk that has my gene and variants of interest in? 

(tick) Yes. All chunks are named in the following format: gel_mainProgramme_aggV2_chromosome_start_stop.vcf.gz - for example - gel_mainProgramme_aggV2_chr1_146620016_147701894.vcf.gz. We have written an easy-to-use script to help you identify the correct chunk to use for your variants(s), gene(s), regions(s) of interest. Please see here: aggV2 Code Book::General Information


Quality control status of the samples in aggV2

(question) Do I need to exclude any samples in aggV2 based on overall sample quality - such as coverage, contamination, and mapping rate? 

(tick) All 78,195 samples in aggV2 pass our internal sample QC thresholds which you can read more about here: Sample QC. You can see sample-level quality metrics for all samples in aggV2 using the aggregate_gvcf_sample_stats LabKey table. Please be aware that there are 706 (<1%) samples that are derived from saliva in aggV2. All these samples pass our QC thresholds but we do observe decreased quality (percent aligned reads and AT dropout) of these samples compared to blood samples.  


Participant Sex

(question) I see that there are three columns I can potentially use for sample sex in the aggregate_gvcf_sample_stats LabKey table - which should I use?

(tick) Yes in the aggregate_gvcf_sample_stats LabKey table there are three columns: participant_phenotypic_sex, karyotype, illumina_ploidy that describe the sex of the participant.

  • participant_phenotypic_sex: The participant's stated sex by the clinician at the GMC (Male, Female, Indeterminate)
  • karyotype: The participant's estimated sex chromosome ploidy by the Genomics England Interpretation Pipeline using inference by WGS coverage (note that only participants who have run through the Rare Disease or Cancer interpretation pipelines have data. Those who have not are missing for this field (NA). 
  • illumina_ploidy: The participant's estimated sex chromosome ploidy by the Illumina NSV4 Pipeline using inference by WGS coverage (note that only XX and XY estimates are outputted - other karyotypes are to available from the Illumina pipeline and set to NA).

It is down to the analysis in hand in how to treat missing / discordant sex values.


Genotypes with 'missing' calls

(question) I have come across many samples that have genotypes such as ./1 - what do these represent and how should I incorporate them into my analysis? 

(tick) Multi-allelic variants in aggV2 were decomposed to their bi-allelic representations using vt. This process generates 'partial genotypes'. It is crucial that researchers understand how these are generated and how they can be included. We have written extensive documentation on this. Please see the Variant Normalisation and Variant Representation pages. 


Chromosome Y and M

(question) Are variants included for chromosome Y and M included in aggV2?

(tick) Yes they are but we do not include site QC and FILTER annotations for these chromosomes. Suggestions are welcome on site QC for chromosomes Y and M! 


Chromosome X - XY sample pHWE

(question) I can't find per population HWE scores for male XY samples in the chrom X VCFs?

(tick) Despite the scores being defined in the header, we don't calculate HWE scores for the XY male samples on chrom X.


Sample Ancestry

(question) How do I know the ancestry of the samples within aggV2?

(tick) We have calculated genetically inferred ancestry of all samples within aggV2. Please see our methods here: Ancestry inference. Once can access the ancestry membership probability by 1000 Genome super-population from the aggregate_gvcf_sample_stats LabKey table.