NGS-PCA Interactive Report

Principal Component Analysis of 1000 Genomes Project — Batch Effects & Population Structure

Samples
Populations
Superpopulations
PCs computed
MP-significant PCs

Introduction & Background

What is NGS-PCA?

NGS-PCA performs principal component analysis (PCA) directly on next-generation sequencing (NGS) summary statistics — specifically, read-depth signals across the genome — without requiring genotype calls. It applies a randomized singular value decomposition (SVD) to a samples × genomic-bins coverage matrix, producing orthogonal principal components that capture the dominant axes of variation in sequencing data.

Why PCA on sequencing data?

NGS-PCA is designed to detect and characterise technical sources of variation in sequencing data — primarily batch effects arising from differences in library preparation, sequencing runs, and read-depth profiles — without requiring genotype calls. It is intended as a sequencing quality-control and diagnostics view of the data, highlighting dominant coverage-level structure that can otherwise be hard to detect from summary metrics alone. Key properties:

About this report

This interactive report applies NGS-PCA to the 1000 Genomes Project dataset. We decompose the coverage matrix, visualise the resulting PCs, and quantify how much of the variation in each PC is attributable to technical factors (sequencing batch) versus biological factors (continental ancestry, sex). Because NGS-PCA PCs primarily reflect coverage-level technical variation, batch effects are expected to dominate the leading components. However, when batches are enriched for specific populations or sexes, technical and biological signals can become confounded. We formally test for such enrichment and present effect-size metrics (η², Cramér's V, point-biserial r) so that readers can assess the severity of any confounding.

Variance Explained by Principal Components

These plots summarize the variance structure of the k components retained by the randomized SVD (RSVD). Because RSVD is a truncated decomposition that computes only the leading k singular values, the percentages shown are normalized to the sum of the retained eigenvalues (Σ σi2, i = 1…k), not to the full data-matrix variance (which requires all singular values). As a result, the cumulative curve reaches 100% at the last retained PC by construction of the normalization — it does not imply that the retained components explain all variance in the original data matrix. The dashed orange line marks the Marchenko–Pastur (MP) cutoff — PCs to the left carry more variance than expected under a random-noise null model and are considered statistically significant.

Scree Plot

Cumulative Variance

PCA Scatter Plots

Interactive 3D scatter plot of principal components (select up to three PCs from PC1–10). Toggle the colour overlay to explore batch effects, population labels, sex differences, or any continuous QC metric on a heatmap scale. When a categorical variable is selected, the second panel shows boxplots of PC distributions by group; when a continuous metric is selected, it shows Pearson and Spearman correlation bar charts (computed at runtime).

PCA 3D Scatter

0 100

Distribution

UMAP Projection

Two-dimensional UMAP embedding computed from a Marchenko–Pastur-selected number of principal components. Colour by categorical grouping or a continuous QC metric. The second panel shows boxplots (categorical) or correlation bar charts (continuous).

UMAP (PCs)

0 100

Distribution

Confounding Assessment

Before interpreting NGS-PCA results, it is important to check whether technical variables (sequencing batch) are confounded with biological variables (ancestry, sex). Because NGS-PCA PCs primarily capture coverage-level technical variation, batch effects are expected to drive the leading components. However, if batches are enriched for particular populations or sexes, apparent "batch effects" on PCs may be intertwined with biological signal, and vice versa. We use Pearson's χ² test of independence for each pair and report Cramér's V as an effect size. Standardised residuals identify which specific cells are over- or under-represented.

Pedigree-based Relatedness Distance

Rationale

A key validation for NGS-PCA is that it captures technical rather than biological (familial) variation. In genotype-based PCA, closely related individuals (parent–child, siblings) cluster tightly because they share large fractions of their genome. If NGS-PCA instead captures coverage-level technical variation, relatives should be no closer to one another than any other randomly nearby individual in PC space.

Method

We parse the 1000 Genomes pedigree file to identify all first-degree relative pairs (parent–child from Paternal/Maternal ID columns, and siblings from the pedigree Siblings column and shared-parent detection) that are both present in the NGS-PCA dataset. For each individual i with at least one relative in the dataset:

A paired Wilcoxon signed-rank test compares the two distance distributions across all individuals with relatives. The key question is whether relatives are closer than the overall nearest neighbour. If NGS-PCA does not recapitulate familial relatedness, we expect d(nearest non-self) < d(nearest relative) for most individuals — i.e., non-relatives tend to be closer than relatives. Failure of relatives to cluster (large p-value, or significant result showing non-relatives are closer) supports the interpretation that NGS-PCA primarily reflects technical rather than biological signal.

Results

Nearest Relative vs. Nearest Non-self Distance

Paired Comparison (dot plot)

Within- vs. Between-Ancestry Distance

Rationale

This analysis parallels the pedigree-based relatedness distance test above but asks a complementary question: does NGS-PCA group individuals by genetic ancestry (superpopulation)? The relatedness test showed that relatives are not closer together than strangers; this test asks whether same-ancestry individuals are closer together than different-ancestry individuals.

Critically, this analysis makes no prior assumption about what NGS-PCA captures. If the result is non-significant, it supports the claim that NGS-PCA reflects technical variation. If significant, we characterise the signal honestly and investigate whether it is driven by batch–ancestry confounding.

Method

For each individual i in the dataset, we compute the Euclidean distance in the top k Marchenko–Pastur-selected PCs to the nearest same-superpopulation neighbour (dwithin) and the nearest different-superpopulation neighbour (dbetween). The test statistic is the mean of δ = dbetweendwithin. A positive mean δ means same-ancestry individuals are closer together.

Global permutation test: superpopulation labels are shuffled uniformly across all samples and mean δ is recomputed. This is repeated N times to build a null distribution. The primary p-value is one-sided (right tail), testing whether δ > 0 (i.e., same-ancestry individuals are closer): p = (# permutations where δperm ≥ δobs + 1) / (N + 1) (Phipson & Smyth, 2010). A two-sided p-value is also reported for completeness.

Per-superpopulation p-values: for each ancestry group, the observed median δ is compared to its null distribution under label shuffling to yield a group-wise permutation p-value. This reveals whether specific groups drive the global signal.

Within-batch permutation (diagnostic): if the global test is significant, we ask whether the signal persists after accounting for batch–ancestry confounding. Superpopulation labels are permuted only within each sequencing batch, preserving the batch–ancestry frequency structure. If the within-batch p-value is non-significant while the global p-value is significant, the apparent ancestry signal is explained by batch confounding.

Results

Within- vs. Between-Ancestry Nearest-Neighbour Distance

Permutation Null Distribution (Global)

Permutation Test of η² Significance

Rationale

Effect sizes (η²) quantify how much variance in each PC is explained by RELEASE_BATCH or SUPERPOPULATION, but an effect size alone cannot establish whether the association is larger than expected by chance. A non-parametric permutation test provides exact p-values with no distributional assumptions — it tests whether each observed η² is inconsistent with random label assignment.

The manuscript's central claim is that NGS-PCA primarily captures technical (batch) rather than biological (ancestry) variation. Formal testing of both variables on every Marchenko–Pastur-selected PC is therefore essential: we expect batch η² to be statistically significant on multiple PCs, while ancestry η² should generally be non-significant or smaller.

Method

For each of the k Marchenko–Pastur-selected PCs and each variable (RELEASE_BATCH, SUPERPOPULATION, Coverage MAD, Coverage IQR, Median Coverage):

  1. Observed effect size — η² for categorical variables (RELEASE_BATCH, SUPERPOPULATION) computed via one-way ANOVA decomposition: η² = SSbetween / SStotal. Pearson r² for continuous coverage metrics.
  2. Null distribution — the variable's values are randomly shuffled across all samples and the effect size is recomputed. This is repeated N times (see Results below). Each shuffle preserves marginal distributions by construction (np.random.permutation) but destroys any real association.
  3. Empirical p-value (Phipson & Smyth, 2010 conservative correction): p = (# permutations where effectperm ≥ effectobs + 1) / (N + 1). The +1 in numerator and denominator prevents p = 0 and is conservative.

Significance thresholds: *** p < 0.001  ·  ** p < 0.01  ·  * p < 0.05  ·  ns p ≥ 0.05.

Results

Global Significance Overview — −log10(p) per PC

Observed Effect Size with Significance Annotations

Variance Partitioning (Partial η²)

Rationale

Marginal η² for RELEASE_BATCH and SUPERPOPULATION conflates shared and unique variance, because batch composition is non-uniform across ancestries. To isolate each predictor's unique contribution, we use a Type III partial η² decomposition via leave-one-out R² differences.

Model

For each Marchenko–Pastur-selected PC:

PCi ~ RELEASE_BATCH + SUPERPOPULATION + FAMILY_ROLE + MEAN_AUTOSOMAL_COV

Unique contributions are computed as leave-one-out R² differences (Type III SS):

Categorical predictors are one-hot encoded (drop-first); coverage is standardised. Results are generated by scripts/08_variance_partitioning.py.

Results

Variance Components per PC (Stacked Bar)

Within-Ancestry Stratified Batch Effect

Rationale

The strongest possible evidence that NGS-PCA captures batch — not ancestry — is to show that within a single superpopulation (e.g., EUR-only samples), batch still drives PC separation. If batch η² remains large within each ancestry group, the batch effect is independent of population structure. If it vanishes, the "batch" signal was confounded with ancestry.

Method

For each of the five 1000 Genomes superpopulations (AFR, AMR, EAS, EUR, SAS), samples are restricted to that group and η²(RELEASE_BATCH) and η²(FAMILY_ROLE) are computed independently for each Marchenko–Pastur-selected PC. A permutation test within each subset provides exact p-values (Phipson & Smyth conservative correction). Results are generated by scripts/09_within_ancestry_batch.py.

Results

Cross-Modality Validation: NGS-PCA vs. Array-Based PCA

Rationale

NGS-PCA derives principal components from autosomal read-depth profiles, which are sensitive to mappability, GC content, library preparation, and reference bias. Genotype-based PCA, in contrast, operates on allele frequencies from SNP array genotyping and is immune to these coverage-based artefacts. Comparing the two modalities tests whether the ancestry signal captured by NGS-PCA reflects genuine population structure or is dominated by technical artefacts.

Methods

Array-based ancestry PCs were derived from Illumina Global Screening Array (GSA) genotyping data processed through the illumina_idat_processing pipeline. Raw IDAT intensity files were converted to genotype calls via bcftools idat2gtc and gtc2vcf plugins (bypassing Genome Studio), followed by study-specific cluster recalibration, stringent sample QC (call rate ≥ 0.97, LRR SD ≤ 0.35), relatedness pruning, heterozygosity outlier removal, ancestry-stratified HWE filtering, LD pruning, and flashpca2 PCA computation with projection. Twenty array-based PCs were computed from unrelated, non-outlier samples and projected onto the full cohort.

PC–PC correlation matrix. Pearson and Spearman correlations are computed between each Marchenko–Pastur-selected NGS-PC and each of the 20 array-PCs. High absolute correlations along the diagonal indicate that the two modalities recover similar axes of variation.

Canonical Correlation Analysis (CCA). CCA (sklearn.cross_decomposition.CCA) identifies linear combinations of each PC set that maximise correlation, yielding canonical correlation coefficients that quantify overall subspace alignment without requiring matched component ordering.

Procrustes rotation. The NGS-PC and array-PC embeddings are optimally aligned via translation, rotation, and uniform scaling (scipy.spatial.procrustes). The Procrustes disparity (sum of squared residuals after alignment, normalised to [0, 1]) measures geometric dissimilarity; values near 0 indicate near-identical structure.

Residual batch test. For each NGS-PC, the top array-PCs are regressed out via OLS. η²(RELEASE_BATCH) is then computed on the residuals. If batch signal persists in the residuals — i.e. the portion of NGS-PC variance not explained by genotype-based ancestry — it suggests that the batch effect is a genuine technical artefact independent of population structure. If batch η² drops to near zero after adjustment, the apparent "batch" effect was confounded with ancestry differences between sequencing phases.

Results are generated by scripts/10_crossmodality_benchmark.py.

Results

NGS-PC × Array-PC Correlation Heatmap

Canonical Correlations (CCA)

Batch η² Before & After Array-PC Adjustment

Reference-Genome Bias Audit

Rationale

The GRCh38 reference genome is derived primarily from individuals of European ancestry. This creates the possibility that mappability bias — systematic differences in read alignment and coverage driven by reference similarity — could introduce ancestry-correlated gradients in genome-wide coverage statistics. If such gradients persist after bin curation and QC filtering, they indicate residual reference-driven technical bias that could confound NGS-PCA components.

Methods

QC metrics by superpopulation. Violin plots of key autosomal coverage statistics (mean, median, dispersion) are stratified by superpopulation (AFR, AMR, EAS, EUR, SAS) as a visual screen for ancestry-correlated differences. Each violin spans a single metric, with inner box-and-whisker summaries and individual data points overlaid.

OLS regression. For each QC metric, an ordinary least-squares model is fit: QC_metric ~ SUPERPOPULATION + RELEASE_BATCH. Partial η² is computed for each predictor after adjusting for the other: the outcome is residualised by group-centring with respect to the nuisance variable, and η² is then computed on the adjusted values. Significant SUPERPOPULATION terms after controlling for batch indicate possible reference-driven bias in that metric.

Feature–feature correlation heatmap. Absolute Pearson correlations are computed among the Marchenko–Pastur-selected NGS PCs, array-based genotype PCs (first 10), and the core QC metrics. Hierarchical clustering of the resulting matrix reveals whether QC variables co-cluster with ancestry-sensitive principal components, suggesting shared reference-driven variation.

Results are generated by scripts/11_reference_bias_audit.py.

Results

QC Metrics by Superpopulation

Partial η² — Ancestry vs Batch Effect on QC Metrics

Feature–Feature Correlation (NGS PCs × Array PCs × QC)

Variance Partitioning of QC Coverage Metrics

Rationale. The partial η² regression above tests whether each QC metric differs significantly across ancestries after adjusting for batch. Here we go further and decompose the total explained variance (R²) of each QC metric into three components:

If the reference genome introduces ancestry-correlated coverage bias, the unique ancestry component should be non-negligible for coverage metrics like mean/median, while dispersion metrics (MAD, IQR) may be less affected. Conversely, if observed ancestry–coverage associations vanish after conditioning on batch, the signal is driven by batch composition rather than reference-genome bias.

Results are generated by scripts/12_robust_qc_variance.py using the same leave-one-out R² approach as the PC variance partitioning analysis.

Results

QC Metric Variance Decomposition — Ancestry vs Batch

QC Metrics by Superpopulation (Batch-Conditioned)

PC × QC Variable Associations

Effect sizes between each PC and every available QC variable. Categorical variables (batch, population, sex, relatedness) use η² (eta-squared) — the proportion of PC-score variance explained by group membership, equivalent to SSbetween / SStotal from a one-way ANOVA. η² is the natural R²-analogue for a categorical predictor on a continuous outcome and is therefore the appropriate metric here: it directly answers "how much of this PC's spread does the grouping capture?". Cramér's V, by contrast, measures association between two categorical variables and would require discretising the PC scores, losing information. Continuous variables (coverage metrics) use Pearson . P-values from the corresponding ANOVA F-test (categorical) or Pearson t-test (continuous) are shown on hover. High values indicate that a QC variable explains substantial variance in that PC, suggesting the PC captures variation driven by that variable.

Association Heatmap (η² / r²)