Page tree
Skip to end of metadata
Go to start of metadata

1. Overview

Using the multi-sample VCFs from aggV2, we have generated Principal Components (PCs) for participants in aggV2, calculated pairwise relatedness amongst samples, and estimated probabilities of genetic ancestry for five broad super-populations. In this page we outline our approach and  link to the outputs as they are provided in the Genomics England Research Environment. 

We estimated broad genetic ancestry using ethnicities from the 1000 genomes project phase 3 (1KGP3)  as the truth, by generating PCs for 1KGP3 samples and projecting all aggV2 participants onto these. The five broad super-populations are:

amrAdmixed American
easEast Asian
sasSouth Asian

2. Ancestry inference

We used the 1KGP3 to infer ancestry as follows:

  1. We took all unrelated samples from the 1KGP3
  2. We subsetted to just our 188382 HQ SNPs
  3. Further filtered for MAF > 0.05 in 1KGP3 (as well as in our data)
  4. We calculated the first 20 PCs using GCTA
  5. We projected the AggV2 data onto the 1KGP3 PC loadings
  6. We trained a random forest model to predict ancestries based on
    1. First 8 1KGP3 PCs
    2. set Ntrees = 400 
    3. Train and predict on 1KGP3 amr, afr, eas, eur and sas super-populations

2.1. Model performance

Below we show the summary data for the random forest model fit. The OOB error rate and confusion matrix show very high performance in the prediction of 1KGP3 super-populations.

Random Forest ancestry model fit
 randomForest(x = rfdat[, pcs1_8], y = SuperPopLabels, ntree = 400,      keep.inbag = T) 
               Type of random forest: classification
                     Number of trees: 400
No. of variables tried at each split: 2

        OOB estimate of  error rate: 0.24%
Confusion matrix:
    AFR AMR EAS EUR SAS class.error
AFR 638   2   0   0   0  0.00312500
AMR   3 342   0   1   0  0.01156069
EAS   0   0 498   0   0  0.00000000
EUR   0   0   0 499   0  0.00000000
SAS   0   0   0   0 480  0.00000000

The probabilities for each individual is found at:


Additionally, for users interested in more fine-grained population structure, we provide a set of ancestry predictions based sub-population ancestries from the 1KGP3. The steps to calculate are as above and differ only for step 3 and 6. 

3 - MAF filter of >0.01 for 1KGP3 and aggV2 data

6 - We trained a random forest model to predict ancestries based on 1KGP3 sub-populations

These data are available at:


3. Ancestry summary stats

Below is a summary table for the number of individuals (and as a percent of the cohort) assigned with a probability of >0.8 for any one ancestry. 


4. PCs with 1KG samples and projected aggV2 samples, coloured by predicted ancestry

Below we show the first 6 PCs, which were used for the ancestry inference of the aggV2 samples. The plots to the left show all samples (in gray), with the 1KGP3 samples plotted in different colours by super-population. The plots to the right show all samples (in gray), with the aggV2 samples plotted in different colours by predicted super-population (using a threshold of T=0.8). 1KG samples are represented by crosses, and aggV2 samples by solid circles.

The following plot focuses on EUR and EAS sub-populations from 1KGP3. 1KG samples are represented by crosses, and aggV2 samples by solid circles. PCs for all 1KGP3 and aggV2 samples are included, in gray. In addition: 

Left: 1KGP3 samples in different colours by super-population

Middle: 1KGP3 samples in different colours by EAS sub-populations, with aggV2 predicted EAS plotted on top

Right: 1KGP3 samples in different colours by NFE and FIN populations, with aggV2 predicted EUR samples plotted in darkblue.