Principal Component Analysis of 1000 Genomes Project — Batch Effects & Population Structure
NGS-PCA performs principal component analysis (PCA) directly on next-generation sequencing (NGS) summary statistics — specifically, read-depth signals across the genome — without requiring genotype calls. It applies a randomized singular value decomposition (SVD) to a samples × genomic-bins coverage matrix, producing orthogonal principal components that capture the dominant axes of variation in sequencing data.
NGS-PCA is designed to detect and characterise technical sources of variation in sequencing data — primarily batch effects arising from differences in library preparation, sequencing runs, and read-depth profiles — without requiring genotype calls. It is intended as a sequencing quality-control and diagnostics view of the data, highlighting dominant coverage-level structure that can otherwise be hard to detect from summary metrics alone. Key properties:
This interactive report applies NGS-PCA to the 1000 Genomes Project dataset. We decompose the coverage matrix, visualise the resulting PCs, and quantify how much of the variation in each PC is attributable to technical factors (sequencing batch) versus biological factors (continental ancestry, sex). Because NGS-PCA PCs primarily reflect coverage-level technical variation, batch effects are expected to dominate the leading components. However, when batches are enriched for specific populations or sexes, technical and biological signals can become confounded. We formally test for such enrichment and present effect-size metrics (η², Cramér's V, point-biserial r) so that readers can assess the severity of any confounding.
These plots summarize the variance structure of the k components retained by the randomized SVD (RSVD). Because RSVD is a truncated decomposition that computes only the leading k singular values, the percentages shown are normalized to the sum of the retained eigenvalues (Σ σi2, i = 1…k), not to the full data-matrix variance (which requires all singular values). As a result, the cumulative curve reaches 100% at the last retained PC by construction of the normalization — it does not imply that the retained components explain all variance in the original data matrix. The dashed orange line marks the Marchenko–Pastur (MP) cutoff — PCs to the left carry more variance than expected under a random-noise null model and are considered statistically significant.
Interactive 3D scatter plot of principal components (select up to three PCs from PC1–10). Toggle the colour overlay to explore batch effects, population labels, sex differences, or any continuous QC metric on a heatmap scale. When a categorical variable is selected, the second panel shows boxplots of PC distributions by group; when a continuous metric is selected, it shows Pearson and Spearman correlation bar charts (computed at runtime).
Two-dimensional UMAP embedding computed from a Marchenko–Pastur-selected number of principal components. Colour by categorical grouping or a continuous QC metric. The second panel shows boxplots (categorical) or correlation bar charts (continuous).
Before interpreting NGS-PCA results, it is important to check whether technical variables (sequencing batch) are confounded with biological variables (ancestry, sex). Because NGS-PCA PCs primarily capture coverage-level technical variation, batch effects are expected to drive the leading components. However, if batches are enriched for particular populations or sexes, apparent "batch effects" on PCs may be intertwined with biological signal, and vice versa. We use Pearson's χ² test of independence for each pair and report Cramér's V as an effect size. Standardised residuals identify which specific cells are over- or under-represented.
This analysis parallels the pedigree-based relatedness distance test above but asks a complementary question: does NGS-PCA group individuals by genetic ancestry (superpopulation)? The relatedness test showed that relatives are not closer together than strangers; this test asks whether same-ancestry individuals are closer together than different-ancestry individuals.
Critically, this analysis makes no prior assumption about what NGS-PCA captures. If the result is non-significant, it supports the claim that NGS-PCA reflects technical variation. If significant, we characterise the signal honestly and investigate whether it is driven by batch–ancestry confounding.
For each individual i in the dataset, we compute the Euclidean distance in the top k Marchenko–Pastur-selected PCs to the nearest same-superpopulation neighbour (dwithin) and the nearest different-superpopulation neighbour (dbetween). The test statistic is the mean of δ = dbetween − dwithin. A positive mean δ means same-ancestry individuals are closer together.
Global permutation test: superpopulation labels are shuffled uniformly
across all samples and mean δ is recomputed. This is repeated N times to build
a null distribution. The primary p-value is one-sided (right tail),
testing whether δ > 0 (i.e., same-ancestry individuals are closer):
p = (# permutations where δperm ≥ δobs + 1) / (N + 1)
(Phipson & Smyth, 2010). A two-sided p-value is also reported for
completeness.
Per-superpopulation p-values: for each ancestry group, the observed median δ is compared to its null distribution under label shuffling to yield a group-wise permutation p-value. This reveals whether specific groups drive the global signal.
Within-batch permutation (diagnostic): if the global test is significant, we ask whether the signal persists after accounting for batch–ancestry confounding. Superpopulation labels are permuted only within each sequencing batch, preserving the batch–ancestry frequency structure. If the within-batch p-value is non-significant while the global p-value is significant, the apparent ancestry signal is explained by batch confounding.
Effect sizes (η²) quantify how much variance in each PC is explained by RELEASE_BATCH or SUPERPOPULATION, but an effect size alone cannot establish whether the association is larger than expected by chance. A non-parametric permutation test provides exact p-values with no distributional assumptions — it tests whether each observed η² is inconsistent with random label assignment.
The manuscript's central claim is that NGS-PCA primarily captures technical (batch) rather than biological (ancestry) variation. Formal testing of both variables on every Marchenko–Pastur-selected PC is therefore essential: we expect batch η² to be statistically significant on multiple PCs, while ancestry η² should generally be non-significant or smaller.
For each of the k Marchenko–Pastur-selected PCs and each variable (RELEASE_BATCH, SUPERPOPULATION, Coverage MAD, Coverage IQR, Median Coverage):
np.random.permutation)
but destroys any real association.p = (# permutations where effectperm ≥ effectobs + 1)
/ (N + 1).
The +1 in numerator and denominator prevents p = 0 and is conservative.Significance thresholds: *** p < 0.001 · ** p < 0.01 · * p < 0.05 · ns p ≥ 0.05.
Marginal η² for RELEASE_BATCH and SUPERPOPULATION conflates shared and unique variance, because batch composition is non-uniform across ancestries. To isolate each predictor's unique contribution, we use a Type III partial η² decomposition via leave-one-out R² differences.
For each Marchenko–Pastur-selected PC:
PCi ~ RELEASE_BATCH + SUPERPOPULATION + FAMILY_ROLE + MEAN_AUTOSOMAL_COV
Unique contributions are computed as leave-one-out R² differences (Type III SS):
Categorical predictors are one-hot encoded (drop-first); coverage is standardised.
Results are generated by scripts/08_variance_partitioning.py.
The strongest possible evidence that NGS-PCA captures batch — not ancestry — is to show that within a single superpopulation (e.g., EUR-only samples), batch still drives PC separation. If batch η² remains large within each ancestry group, the batch effect is independent of population structure. If it vanishes, the "batch" signal was confounded with ancestry.
For each of the five 1000 Genomes superpopulations (AFR, AMR, EAS, EUR, SAS),
samples are restricted to that group and η²(RELEASE_BATCH) and
η²(FAMILY_ROLE) are computed independently for each
Marchenko–Pastur-selected PC. A permutation test within each subset provides
exact p-values (Phipson & Smyth conservative correction).
Results are generated by scripts/09_within_ancestry_batch.py.
NGS-PCA derives principal components from autosomal read-depth profiles, which are sensitive to mappability, GC content, library preparation, and reference bias. Genotype-based PCA, in contrast, operates on allele frequencies from SNP array genotyping and is immune to these coverage-based artefacts. Comparing the two modalities tests whether the ancestry signal captured by NGS-PCA reflects genuine population structure or is dominated by technical artefacts.
Array-based ancestry PCs were derived from Illumina Global Screening Array (GSA)
genotyping data processed through the
illumina_idat_processing pipeline. Raw IDAT intensity files
were converted to genotype calls via bcftools idat2gtc and
gtc2vcf plugins (bypassing Genome Studio), followed by study-specific
cluster recalibration, stringent sample QC (call rate ≥ 0.97, LRR SD ≤ 0.35),
relatedness pruning, heterozygosity outlier removal, ancestry-stratified HWE
filtering, LD pruning, and flashpca2 PCA computation with projection.
Twenty array-based PCs were computed from unrelated, non-outlier samples and
projected onto the full cohort.
PC–PC correlation matrix. Pearson and Spearman correlations are computed between each Marchenko–Pastur-selected NGS-PC and each of the 20 array-PCs. High absolute correlations along the diagonal indicate that the two modalities recover similar axes of variation.
Canonical Correlation Analysis (CCA). CCA
(sklearn.cross_decomposition.CCA) identifies linear combinations of
each PC set that maximise correlation, yielding canonical correlation coefficients
that quantify overall subspace alignment without requiring matched component ordering.
Procrustes rotation. The NGS-PC and array-PC embeddings are
optimally aligned via translation, rotation, and uniform scaling
(scipy.spatial.procrustes). The Procrustes disparity
(sum of squared residuals after alignment, normalised to [0, 1]) measures
geometric dissimilarity; values near 0 indicate near-identical structure.
Residual batch test. For each NGS-PC, the top array-PCs are regressed out via OLS. η²(RELEASE_BATCH) is then computed on the residuals. If batch signal persists in the residuals — i.e. the portion of NGS-PC variance not explained by genotype-based ancestry — it suggests that the batch effect is a genuine technical artefact independent of population structure. If batch η² drops to near zero after adjustment, the apparent "batch" effect was confounded with ancestry differences between sequencing phases.
Results are generated by scripts/10_crossmodality_benchmark.py.
The GRCh38 reference genome is derived primarily from individuals of European ancestry. This creates the possibility that mappability bias — systematic differences in read alignment and coverage driven by reference similarity — could introduce ancestry-correlated gradients in genome-wide coverage statistics. If such gradients persist after bin curation and QC filtering, they indicate residual reference-driven technical bias that could confound NGS-PCA components.
QC metrics by superpopulation. Violin plots of key autosomal coverage statistics (mean, median, dispersion) are stratified by superpopulation (AFR, AMR, EAS, EUR, SAS) as a visual screen for ancestry-correlated differences. Each violin spans a single metric, with inner box-and-whisker summaries and individual data points overlaid.
OLS regression. For each QC metric, an ordinary
least-squares model is fit:
QC_metric ~ SUPERPOPULATION + RELEASE_BATCH.
Partial η² is computed for each predictor after adjusting for the
other: the outcome is residualised by group-centring with respect to the
nuisance variable, and η² is then computed on the adjusted values.
Significant SUPERPOPULATION terms after controlling for batch indicate
possible reference-driven bias in that metric.
Feature–feature correlation heatmap. Absolute Pearson correlations are computed among the Marchenko–Pastur-selected NGS PCs, array-based genotype PCs (first 10), and the core QC metrics. Hierarchical clustering of the resulting matrix reveals whether QC variables co-cluster with ancestry-sensitive principal components, suggesting shared reference-driven variation.
Results are generated by scripts/11_reference_bias_audit.py.
Rationale. The partial η² regression above tests whether each QC metric differs significantly across ancestries after adjusting for batch. Here we go further and decompose the total explained variance (R²) of each QC metric into three components:
If the reference genome introduces ancestry-correlated coverage bias, the unique ancestry component should be non-negligible for coverage metrics like mean/median, while dispersion metrics (MAD, IQR) may be less affected. Conversely, if observed ancestry–coverage associations vanish after conditioning on batch, the signal is driven by batch composition rather than reference-genome bias.
Results are generated by scripts/12_robust_qc_variance.py
using the same leave-one-out R² approach as the PC variance
partitioning analysis.
Effect sizes between each PC and every available QC variable. Categorical variables (batch, population, sex, relatedness) use η² (eta-squared) — the proportion of PC-score variance explained by group membership, equivalent to SSbetween / SStotal from a one-way ANOVA. η² is the natural R²-analogue for a categorical predictor on a continuous outcome and is therefore the appropriate metric here: it directly answers "how much of this PC's spread does the grouping capture?". Cramér's V, by contrast, measures association between two categorical variables and would require discretising the PC scores, losing information. Continuous variables (coverage metrics) use Pearson r². P-values from the corresponding ANOVA F-test (categorical) or Pearson t-test (continuous) are shown on hover. High values indicate that a QC variable explains substantial variance in that PC, suggesting the PC captures variation driven by that variable.