Page tree
Skip to end of metadata
Go to start of metadata

1. Overview

The decomposition of multi-allelic variants into their bi-allelic forms leads to partial genotypes (such as "./1"), that users may not have often encountered in the past. Additionally, the decomposition of MNPs (Multi-Nucleotide Polymorphisms) into their constitutive SNP (Single-Nucleotide Polymorphisms) can lead into an apparent duplication for a small subset of variants. 

In this page we aim to explain such characteristics of aggV2 along with other key nuances of the dataset. It is important to understand how variants and sample genotypes are represented in aggV2 before working with this dataset.

2. Partial Genotypes

2.1. Overview

The normalisation procedure applied by vt decomposes all multi-allelic variants into their bi-allelic representations.

Definitions:

  • Multi-allelic: where a single variant contains three or more observed alleles, counting the reference as one, therefore allowing for two or more variant alleles (heterozygous genotype example: 1/2)
  • Bi-allelic: where a variant contains two observed alleles, counting the reference as one, and therefore allowing for one variant allele (heterozygous genotypes are always: 0/1)

Many downstream tools rely on variants being represented in their bi-allelic format. Bi-allelic representation also allows for easier allelic comparisons between call sets. 

One should be aware however, of the issues in handling variants from vertical decomposition as partial genotypes are generated. 

From vt: Information is generally lost after vertically decomposing a variant, so care should be taken in interpreting the resultant values.

2.2. The OLD_MULTIALLELIC INFO tag

Multi-allelic variants that have been decomposed into their bi-allelic representations are identified by the OLD_MULTIALLELIC tag in the INFO field of aggV2. 

2.3. Worked Example

Below is a worked example of how multi-allelic variants are represented in their bi-allelic format: 

2.3.1. Pre-decomposed (multi-allelic representation)

CHROMPOSIDREFALTQUALFILTERINFOFORMATSAMPLE 1SAMPLE 2
chr13759889.TATAA,TAAA,T.PASS.GT1/20/0
  • There are four alleles of this variant (including the REF allele). Sample 1 has genotype 1/2 (TAA, TAAA). Sample 2 has genotype 0/0 (TA, TA).

2.3.2. Post-decomposition (bi-allelic representation)

CHROMPOSIDREFALTQUALFILTERINFOFORMATSAMPLE 1SAMPLE 2
chr13759889.TATAA.PASSOLD_MULTIALLELIC=1:3759889:TA/TAA/TAAA/TGT1/.0/0
chr13759889.TATAAA.PASSOLD_MULTIALLELIC=1:3759889:TA/TAA/TAAA/TGT./10/0
chr13759889.TAT.PASSOLD_MULTIALLELIC=1:3759889:TA/TAA/TAAA/TGT./.0/0
  • The three ALT alleles have been decomposed into three separate lines; where each line represents one of the ALT alleles against the same REF allele. 
  • The INFO field is populated with the OLD_MULTIALLELIC tag which captures the original multi-allelic representation of the allele. 
  • Partial genotypes are generated for Sample 1 for this variant - who has genotype TAA, TAAA. This is because:
    • For the first bi-allelic variant (TA, TAA), Sample 1 has the TAA ALT allele but not the TA REF allele (no information present for this allele) - so is therefore represented by the partial genotype: 1/.
    • For the second bi-allelic variant (TA, TAAA), Sample 1 has the TAAA ALT allele but not the TA REF allele (no information present for this allele) - so is therefore represented by the partial genotype: ./1
    • For the third bi-allelic variant (TA, T), Sample 1 has neither the T ALT allele nor the TA REF allele (no information present for these alleles) - so is therefore represented by the partial genotype: ./.
  • Partial genotypes are not represented for Sample 2 for this variant - who has genotype TA, TA. This is because:
    • For all bi-allelic variants, Sample 2 is homozygous for the TA REF allele - so is always represented by the full genotype: 0/0

2.4. FORMAT field inheritance

The per-sample FORMAT tags (for example: GT - genotype, GQ - genotype quality, DP - depth, AD - allelic depth, PL - genotype likelihoods) are also vertically decomposed and follow two rules:

  1. The GQ and DP tags are always identical for a given genotype when the variant is decomposed
  2. The AD and PL tags are always representative of the two specific alleles per bi-allelic variant

This is shown in the examples below showing a single sample: 

1) The sample is homozygous for the C REF allele for the chr18:7311133:C/A/T variant

  • The sample genotype will always be 0/0 for all bi-allelic representations of this variant
  • The GQ and DP tags are identical for all bi-allelic representations of this variant
  • The AD and PL tags are identical for all bi-allelic representations of this variant as no information is lost after decomposition

Chrom

Pos

Ref

Alt

OLD_MULTIALLELIC

GT

FT

GQ

GQX

DP

DPF

AD

PL

chr187311133CAchr18:7311133:C/A/T0/0.56.27027,00,255,255
chr187311133CTchr18:7311133:C/A/T0/0.56.27027,00,255,255


2) The sample is heterozygous (T/C) for the chr18:7365195:T/C/G variant

  • The sample genotype is 0/1 for the T/C bi-allelic variant but partial (0/.) for the T/G bi-allelic variant as no information is present for the G allele
  • The GQ and DP tags are identical for all bi-allelic representations of this variant
  • The AD and PL tags are representative of the two specific alleles per bi-allelic variant (no depth in AD for the G allele, and PL set to 255 for the T/G and G/G genotypes)

Chrom

Pos

Ref

Alt

OLD_MULTIALLELIC

GT

FT

GQ

GQX

DP

DPF

AD

PL

chr187365195TCchr18:7365195:T/C/G0/1PASS2004631314,17232,0,197
chr187365195TGchr18:7365195:T/C/G0/.PASS2004631314,0232,255,255


3) The sample is homozygous (A/A) for the chr18:7403330:G/A/C variant 

  • The sample genotype is 1/1 for the G/A bi-allelic variant but partial (./.) for the G/C bi-allelic variant as no information is present for the G or C allele
  • The GQ and DP tags are identical for all bi-allelic representations of this variant
  • The AD and PL tags are representative of the two specific alleles per bi-allelic variant (no depth in AD for the G pr C allele, and PL set to 255 for the G/C and C/C genotypes)

Chrom

Pos

Ref

Alt

OLD_MULTIALLELIC

GT

FT

GQ

GQX

DP

DPF

AD

PL

chr187403330GAchr18:7403330:G/A/C1/1PASS30171100,11218,33,0
chr187403330GCchr18:7403330:G/A/C./.PASS30171100,0218,255,255

2.5. Summary

  • Bi-allelic variants derived from the same multi-allelic variant are identified by the OLD_MULTIALLELIC tag in the INFO field. The tag is present in all bi-allelic representations of the respective multi-allelic variant.
  • Vertical decomposition results in partial genotypes:
    • 0/0 sample genotypes will always be decomposed to 0/0 for remaining bi-allelic variants of the same OLD_MULTIALLELIC tag
    • 0/1 sample genotypes will always be decomposed to 0/. for remaining bi-allelic variants of the same OLD_MULTIALLELIC tag
    • 1/1 sample genotypes will always be decomposed to ./. for remaining bi-allelic variants of the same OLD_MULTIALLELIC tag
  • The GQ and DP tags are always identical for a given genotype when the variant is decomposed; whereas the AD and PL tags are always representative of the two specific alleles per bi-allelic variant

If deemed absolutely necessary, one may want to post-process the partial genotypes like 1/. to the best guess genotype based on the PL values and recompute fields that involves alleles after decomposition. 

aggV1

Please note that aggV1 does not contain partial genotypes. Instead all partial genotypes (containing ".") were represented with the reference allele call (0). 

3. Variant Duplication & Multi-Nucleotide Polymorphisms (MNPs)

3.1. Overview

A duplicated variant is a variant line with the same CHROM, POS, REF, and ALT that is represented more than once in aggV2. It was estimated that approximately 0.02% of the variants in the dataset are formed of duplicated variants.

There is no exact solution to this issue and it is important to handle duplicated variants with care as their allele frequencies might be affected. 

Duplications arise from the decomposition of MNPs (Multi-Nucleotide Polymorphisms) into their constitutive SNP (Single-Nucleotide Polymorphisms) representations by vt (vt decompose_blocksub). This step is carried out post-aggregation. 

SNPs derived from the decomposition of MNPs do not combine/merge with canonical SNPs (not derived from MNPs). This is what causes the duplication of lines. A single variant many be duplicated many times if the MNP is long and there are many canonical SNP variants. 

Definitions

SNP: The reference and alternate sequences are of length 1 and the base nucleotide is different from one another.

MNP: The reference and alternate sequences are of the same length and have to be greater than 1 and all nucleotides in the sequences differ from one another 

3.2. The OLD_CLUMPED INFO tag

SNPs derived from the decomposition of MNPs are flagged with the OLD_CLUMPED tag in the INFO field. 

3.3. Worked Example

Pre-decomposed MNP for a single sample:

CHROMPOSIDREFALTQUALFILTERINFOFORMATSAMPLE 1
20763837.CATG.PASSAC=1;AN=2GT0/1


Decomposed MNP for a single-sample:

The CA > TG MNP has been decomposed into its constitutive SNPs: C > T and A > G.

The INFO filed has been populated with the OLD_CLUMPED tag - which keeps track of the original MNP that was decomposed. 

CHROMPOSIDREFALTQUALFILTERINFOFORMATSAMPLE 1
20763837.CT.PASSAC=1;AN=2;OLD_CLUMPED=20:763837:CA/TGGT0/1
20763838.AG.PASSAC=1;AN=2;OLD_CLUMPED=20:763837:CA/TGGT0/1


Pre-decomposed MNP in multi-sample:

Sample 1 is heterozygous for the CA > TG MNP. 
Sample 2 is heterozygous for the A > G SNP. 

CHROMPOSIDREFALTQUALFILTERINFOFORMATSAMPLE 1SAMPLE 2
20763837.CATG.PASSAC=1;AN=4GT0/10/0
20763837.AG.PASSAC=1;AN=4GT0/00/1


Decomposed MNP in muli-sample: 

The CA > TG MNP in Sample 1 has been decomposed into its constitutive SNPs: C > T and A > G. The INFO filed has been populated with the OLD_CLUMPED tag - which keeps track of the original MNP that was decomposed. 

No change occurs to the A > G SNP for Sample 2. 

SNPs derived from the decomposition of MNPs do not combine/merge with canonical SNPs (not derived from MNPs).

Therefore a duplicated variant (identical CHROM, POS, REF, ALT) is created for the A > G SNP. 

These can be differentiated using the OLD_CLUMPED INFO tag, as this is populated when an MNP is decomposed, but is empty (.) for canonical SNPs. 

CHROMPOSIDREFALTQUALFILTERINFOFORMATSAMPLE 1SAMPLE 2
20763837.CT.PASSAC=1;AN=4;OLD_CLUMPED=20:763837:CA/TGGT0/10/0
20763838.AG.PASSAC=1;AN=4;OLD_CLUMPED=20:763837:CA/TGGT0/10/0
20763838.AG.PASSAC=1;AN=2GT0/10/1

3.4. Identifying unique variants

As mentioned, only ~0.02% of the variants in the dataset are duplicated (identical CHROM, POS, REF, ALT). 

All variants are completely unique however if the following fields are concatenated per variant: 

CHROM, POS, REF, ALT, INFO/OLD_MULTIALLELIC, INFO/OLD_CLUMPED

4. Variants with allele count of 0

4.1. Overview

There are a few instances of variants in aggV2 that have an allele count (AC) of zero. There are two reasons as to why this is observed:

  1. Participants that withdraw from the programme are removed from the dataset post-aggregation. Though their genotypes are removed, their variants are kept, so to avoid confusion of variant lines that may contain partial genotypes. Such variants will have an AC of 0, as the AC is calculated post-removal of withdrawn participants. 
  2. In the single sample gVCFs, certain variants are 'forced-genotyped' meaning that a variant call is made even if no variant exists - i.e. the sample genotype will not be in a REF BLOCK but be coded as 0/0. Forced-genotype variants are preserved in aggV2. Therefore if all samples are 0/0 for a particular forced-genotyped variant, then the AC of that variant will be 0. 

5. Maximum alternate alleles per variant

In the gVCF aggregation process by gvcf genotyper, multi-allelic variants with more than 50 alternate alleles are discarded and not included in aggV2. 

6. Variants within deletions

6.1. Overview

For autosomal variants, the majority of samples will have diploid genotypes (e.g. 0/1). However, some samples will have haploid (hemizygous-like) calls (e.g. 1) for certain variants. Such haploid calls indicate that the respective sample-genotype identified on one chromosome is located within a deletion identified on the other chromosome for the same sample. 

These haploid calls are not produced as part of the aggregation procedure, but are seen in the single-sample gVCFs.

6.2. Worked Example

In the single-sample gVCF, we have identified the following variant where the genotype is represented as haploid ALT call:

CHROMPOSREFALTGTDescription
chr12118756AT1Haploid ALT genotype identified

On closer inspection in the single-sample gVCF, we see that there is a heterozygous call (0/1) for a 2bp deletion (TGA > T) 2bp upstream of the variant (from bases 2118755 - 2118756). Therefore, the A > T SNP above is represented as haploid, because it is located within a known deletion on the other chromosome. 

Please note that reference calls spanning that deletion are also haploid (the G reference call).

CHROMPOSREFALTGTDescription
chr12118754TGAT0/12bp deletion of bases GA from position 2118755 - 2118756. Called as heterozygous (diploid). 
chr12118755G.0We know the G base in position 2118755 is deleted on one chromosome, but on the other it is REF - therefore the hemizygous genotype 0 is called (haploid).
chr12118756AT1We know the A in position 2118756 is deleted on one chromosome, but on the other it is ALT - therefore a hemizygous genotype 1 is called (haploid).

In aggV2, the haploid call for that sample-genotype is carried over as haploid from the single sample gVCF.

7. Help & Support


Help with aggV2

Please reach out via the Genomics England Service Desk for any issues related to the aggV2 aggregation or companion datasets. Include "aggV2" in the title / description of your inquiry.