Illumina IDAT Pipeline Report

📊 Overview

Total Samples

Passing QC

Failing QC

15 / 15

Male / Female

0.9813

Mean Call Rate

0.2072

Mean LRR SD

BAF SD Flags

Het Rate Outliers

QC Pass Rate

80.0% (24 of 30)

Summary Statistics (Stage 2)

Metric	Mean	Median	SD	Min	Max
Call Rate	0.9813	0.9819	0.0141	0.9388	0.9998
LRR SD	0.2072	0.1993	0.0838	0.0825	0.4123
LRR Mean	-0.0017	-0.0013	0.0107	-0.0188	0.0151
LRR Median	-0.0021	-0.0002	0.0117	-0.0216	0.0184
BAF SD (het sites)	0.0636	0.0573	0.0363	0.0205	0.1823
Heterozygosity Rate	0.2503	0.2477	0.0301	0.1534	0.3412

Stage 1 → Stage 2 Comparison

Metric	Stage 1 Mean	Stage 2 Mean	Change
Call Rate	0.9813	0.9813	+0.0000
LRR SD	0.2072	0.2072	+0.0000
BAF SD	0.0636	0.0636	+0.0000
Het Rate	0.2503	0.2503	+0.0000

🎯 GWAS QC Best Practice Thresholds

Reference thresholds based on Anderson et al. (2010), Marees et al. (2018), and Turner et al. (2011). These defaults are used to highlight samples and variants requiring attention, and should be interpreted with study design, ancestry composition, and downstream analysis goals in mind.

Metric	Threshold	Type	Rationale / Evidence
Sample Call Rate	≥ 0.97	Pass/Fail	Low call rate indicates poor DNA quality or failed hybridization (Anderson 2010; Marees 2018)
LRR SD	≤ 0.35	Pass/Fail	High LRR SD indicates noisy intensity data unsuitable for CNV/GWAS (Turner 2011)
BAF SD (het sites)	≤ 0.15	Flag	Elevated BAF SD may indicate contamination or noisy heterozygous intensity clusters (Turner 2011; Marees 2018)
Heterozygosity Rate	Within ± 3 SD	Flag	Outliers may indicate contamination (high) or inbreeding (low); interpret against cohort ancestry structure (Anderson 2010; Marees 2018)
Variant Call Rate	≥ 0.98	Variant QC	Poorly genotyped variants produce spurious associations (Anderson 2010; Turner 2011)
HWE p-value	≥ 1 × 10⁻⁶	Variant QC	Extreme HWE deviation indicates genotyping errors (Anderson 2010; Marees 2018)
MAF	≥ 0.01	Variant QC	Rare variants have low power and higher error sensitivity in GWAS; PCA step uses MAF ≥ 0.05 for stable loadings (Marees 2018)
Inbreeding F	\|F\| ≤ 0.05	Flag	F-statistic outliers may indicate sample quality or relatedness issues (Turner 2011)

Key references: Anderson et al. 2010 (doi:10.1038/nprot.2010.116), Marees et al. 2018 (doi:10.1002/mpr.1608), Turner et al. 2011 (doi:10.1002/0471142905.hg0119s68).

📈 Interactive Sample QC Dashboard

📋 Per-Sample QC Details

30 of 30 samples

Sample ID	Call Rate	LRR SD	LRR Mean	LRR Median	BAF SD	Het Rate	Sex	Status	Flags	Failure Reason

🔄 Reclustering Comparison

Comparison Report

==================================================
  Stage 1 → Stage 2 QC Comparison
==================================================

  Metric              Stage 1     Stage 2     Change
  ─────────────────── ─────────── ─────────── ──────────
  Mean Call Rate       0.9841      0.9893      +0.0052
  Mean LRR SD          0.2312      0.2045      -0.0267
  Mean BAF SD          0.0612      0.0534      -0.0078
  Mean Het Rate        0.2489      0.2478      -0.0011

  Variant Missingness  0.0134      0.0098      -0.0036
  HWE failures (1e-6)  1,245       823         -422
  Mean MAF             0.1723      0.1734      +0.0011
  Ti/Tv ratio          2.05        2.07        +0.02

  Reclustering improved call rate for 28/30 samples
  and reduced LRR SD for 26/30 samples.

🧪 Variant QC

Variant-level QC metrics are shown for the full cohort ("All") and for each ancestry group meeting the minimum sample threshold. Ancestry-stratified HWE testing avoids inflated deviation from the Wahlund effect in mixed-ancestry samples (Anderson et al. 2010; Marees et al. 2018). Cross-ancestry pass metrics summarize whether each variant passes call rate, HWE, MAF, and all metrics combined in every ancestry subset analyzed.

All AFR EAS EUR

Variant QC Summary (All Samples)

==================================================
  Variant QC Summary (autosomes)
==================================================

  Input variants:         650,427
  After call-rate filter: 642,185 (≥ 0.98)
  After HWE filter:       639,412 (p ≥ 1e-6)
  After MAF filter:       531,208 (≥ 0.01)

  Ti/Tv ratio:            2.07
  Mean inbreeding F:      0.0012
  Inbreeding F range:     -0.032 to 0.041

  Missingness histogram:
    0.00-0.01:  ████████████████████████  598,127
    0.01-0.02:  ████                       44,058
    0.02-0.05:  ██                          6,923
    0.05-0.10:  ▏                           1,062
    0.10-1.00:  ▏                             257

  MAF distribution:
    0.00-0.01:  ████████████████           108,977
    0.01-0.05:  ████████████████████       132,245
    0.05-0.10:  ██████████████             98,456
    0.10-0.25:  ████████████████████████  164,123
    0.25-0.50:  ████████████████████████  146,626

Range: 0.00–1.00

Min Max

Range: 0.00–0.50

Min Max

Range: 0.0–50.0

Min Max

Cross-Ancestry Variant QC Pass Summary

Not available. This summary requires collated_variant_qc.tsv to include all_ancestries_*_pass columns from ancestry-stratified QC (or at least one qualifying ancestry group).

Ancestry-Stratified QC Summary

======================================================
  Ancestry-Stratified QC Summary
======================================================

Ancestry Groups Analyzed:
  AFR: 10 samples
  EAS: 10 samples
  EUR: 10 samples

Collated variant QC: /tmp/mock_output/ancestry_stratified_qc/collated_variant_qc.tsv

⚥ Sex Check

Interpretation guide: Females (XX) show higher median chrX LRR (near 0) and low chrY LRR (negative, reflecting no chrY). Males (XY) show lower median chrX LRR (slightly negative, hemizygous X) and higher chrY LRR (near 0 or slightly positive). Outliers may indicate sex discordance, sex chromosome aneuploidies (XXY: high X and Y; XO: very low X with absent Y signal), or sample swaps. Samples outside the expected sex-specific clusters warrant manual review.

Bins: 35

🔬 Peddy QC (Pedigree/Sex/Ancestry)

Peddy validates sex, ancestry, and relatedness from VCF genotypes. Ancestry predictions are colored by predicted population (EUR, AFR, EAS, AMR, SAS). Note: Peddy PCs are projected onto the 1000 Genomes reference panel and are in a separate coordinate space from the pipeline’s own PCA — the two are not directly comparable.

Peddy ancestry PCA projections colored by predicted population (peddy’s own coordinate space). Use “Color by peddy ancestry” in the Ancestry PCA section to recolor the pipeline’s own samples by their peddy ancestry predictions without mixing coordinate spaces.

🌍 Ancestry PCA

Bins: 35

Color by

k PCs

No clusters computed.

Toggle Scatter/Density views, switch 2D/3D, and run customizable k-means clustering for ancestry grouping. “Color by peddy ancestry” recolors the pipeline’s own samples by their peddy-predicted ancestry (no coordinate mixing). “Show peddy PCs” reveals a separate section with peddy’s own PCA plots (peddy coordinate space, not comparable to pipeline PCs).

PCA QC Summary

==================================================
  PCA QC Summary
==================================================

  Input samples:          30
  After call rate ≥ 0.98: 28
  After mind ≤ 0.02:      28

  Input variants:         650,427
  After geno ≤ 0.02:      642,185
  After HWE (1e-6):       639,412
  After MAF ≥ 0.05:       485,234
  After LD pruning:       98,412

  PCs computed: 20
  Eigenvalues: 12.34, 8.56, 5.23, 3.89, 2.67, ...

🧭 Manifest Realignment

Realignment Summary

==================================================
  Manifest Realignment Summary
==================================================

  Manifest:         GSA-24v3-0_A1
  Reference:        chm13v2.0 (T2T-CHM13v2.0)
  Aligner:          bwa 0.7.18

  Total probes:     654,027
  Mapped:           650,427 (99.45%)
  Unmapped:         2,134 (0.33%)
  Multi-mapped:     1,466 (0.22%)

  Position changes: 3,812 (0.58%)
    Same chromosome, different position:  3,701
    Different chromosome:                    111

  Strand changes:   245
  New placements:   89 (previously unmapped)

🔍 QC Diagnostics

Diagnostic Report

==================================================
  QC Diagnostic Report
==================================================

  ✓ Call rate distribution appears normal
  ✓ LRR SD distribution is within expected range
  ⚠ 2 samples flagged for BAF SD > 0.15 (possible contamination)
  ⚠ 2 samples are heterozygosity rate outliers (± 3 SD)
  ✓ Gender predictions consistent (15 M, 15 F)
  ✓ Ti/Tv ratio 2.07 is within expected range (1.8-2.2)
  ✓ No build mismatch detected between manifest and reference

  Recommendations:
  - Review BAF SD-flagged samples (SAMPLE_008, SAMPLE_019) for
    possible contamination or mosaicism
  - Review het rate outliers (SAMPLE_010, SAMPLE_021) for
    population stratification or sample swaps

📚 GWAS Methods and Best-Practice Citations

Compiled citation summary for methods, tools, and algorithms used in this pipeline. The same content is also exported as citations_summary.tsv.

Citation	DOI	Cited for	Pipeline application
Anderson et al. 2010, Nat Protoc 5:1564-1573	doi:10.1038/nprot.2010.116	Foundational GWAS sample/variant QC guidance	Supports sample call rate and heterozygosity outlier review; supports variant call rate and HWE filtering guidance.
Marees et al. 2018, Int J Methods Psychiatr Res 27:e1608	doi:10.1002/mpr.1608	Modern practical GWAS QC workflow recommendations	Supports contamination/noise interpretation (BAF SD, heterozygosity), variant QC thresholds, and stricter PCA MAF filtering for stability.
Turner et al. 2011, Curr Protoc Hum Genet Unit 1.19	doi:10.1002/0471142905.hg0119s68	Array-focused genotype QC and filtering practice	Supports LRR/BAF quality interpretation, variant missingness thresholds, and inbreeding/relatedness flagging context.
Danecek et al. 2021, Gigascience 10(2):giab008	doi:10.1093/gigascience/giab008	bcftools framework for variant processing	Supports IDAT-to-GTC/VCF processing and downstream variant manipulations.
Li and Durbin 2009, Bioinformatics 25(14):1754-1760	doi:10.1093/bioinformatics/btp324	BWA alignment algorithm	Supports manifest probe flank realignment to the selected reference genome.
Chang et al. 2015, Gigascience 4:7	doi:10.1186/s13742-015-0047-8	PLINK analytical framework for large-scale genotype QC	Supports variant-level QC metrics, HWE/missingness/MAF filtering, and LD pruning in ancestry PCA preparation.
Abraham et al. 2017, Bioinformatics 33(17):2776-2778	doi:10.1093/bioinformatics/btx299	flashpca2 randomized PCA implementation	Supports computationally efficient ancestry PCA for large genotype matrices.
Pedersen and Quinlan 2017, Am J Hum Genet 100(3):406-413	doi:10.1016/j.ajhg.2017.01.017	peddy pedigree/sex/ancestry QC	Supports automated sex check, ancestry prediction, and relatedness validation from VCF genotypes via the peddy tool.
Genovese et al. 2024, Bioinformatics 40(2):btae038	doi:10.1093/bioinformatics/btae038	bcftools/liftover assembly-coordinate conversion	Supports coordinate conversion from non-GRCh38 builds to GRCh38 for peddy site matching via bcftools +liftover.
Peterson et al. 2019, Am J Hum Genet 105(5):921-935	doi:10.1016/j.ajhg.2019.09.022	Within-ancestry PCA and ancestry-stratified QC	Supports ancestry-stratified variant QC and within-ancestry PCA to resolve finer population structure and avoid Wahlund effect in HWE testing across mixed-ancestry cohorts.

📝 Methods Text (for publications)

Genotyping intensity data (IDAT files) were processed using the Illumina IDAT Processing Pipeline. Raw IDAT files were converted to GTC format using the GenCall algorithm (bcftools +idat2gtc), then to VCF format with B Allele Frequency (BAF) and Log R Ratio (LRR) intensities (bcftools +gtc2vcf). Probe coordinates were validated by realigning CSV manifest flank sequences against the CHM13 reference genome using BWA-MEM.

A two-stage genotyping approach was employed. Stage 1 genotype calls were generated using manufacturer-provided EGT cluster definitions. Per-sample QC metrics (call rate, LRR standard deviation, BAF standard deviation at heterozygous sites, and heterozygosity rate) were computed on autosomal variants. Following common GWAS QC guidance (Anderson et al., 2010; Marees et al., 2018; Turner et al., 2011), high-quality samples (call rate ≥ 0.97 and LRR SD ≤ 0.35; n = 24 of 30) were used to recompute study-specific genotype cluster definitions (EGT file) in Stage 2. All samples were then re-genotyped using the study-specific clusters, with additional BAF/LRR median adjustment (--adjust-clusters).

Sample ancestry was predicted using peddy (Pedersen and Quinlan, 2017), which projects samples onto 1000 Genomes principal components for ancestry classification. Ancestry principal components were computed using stringent variant QC (missingness < 2%, HWE p ≥ 1e-6, MAF ≥ 5%), LD pruning (window = 1000 kb, step = 1, r² < 0.1), and flashpca2. This stricter PCA MAF threshold was used to improve loading stability. For each ancestry group with sufficient sample size (≥ 100 by default), ancestry-stratified variant QC (missingness, HWE, allele frequency) was performed to avoid inflated HWE deviation due to the Wahlund effect in mixed-ancestry samples (Anderson et al., 2010). Within-ancestry PCA was also computed to resolve finer population structure masked in multi-ancestry PCA (Peterson et al., 2019). Ancestry-specific PCs and variant QC metrics are collated alongside the full-cohort results, including cross-ancestry pass flags for call rate, HWE, MAF, and overall QC. Variant-level QC included per-variant missingness, Hardy-Weinberg equilibrium tests (mid-p adjustment), allele frequency estimation, per-sample inbreeding coefficients (F statistic), and transition/transversion ratio computation (plink2). After processing, the mean call rate was 0.9813 and the mean LRR SD was 0.2072.

🛠️ Tool Versions

bcftools: not installed
plink2: not installed
bwa: not installed
flashpca: not installed
peddy: /opt/hostedtoolcache/Python/3.11.15/x64/bin/python3: No module named peddy

🧬 Illumina IDAT Processing Pipeline Report

📊 Overview

Summary Statistics (Stage 2)

Stage 1 → Stage 2 Comparison

🎯 GWAS QC Best Practice Thresholds

📈 Interactive Sample QC Dashboard

📋 Per-Sample QC Details

🔄 Reclustering Comparison

Comparison Report

🧪 Variant QC

Variant QC Summary (All Samples)

Cross-Ancestry Variant QC Pass Summary

Ancestry-Stratified QC Summary

⚥ Sex Check

🔬 Peddy QC (Pedigree/Sex/Ancestry)

🌍 Ancestry PCA

PCA QC Summary

🧭 Manifest Realignment

Realignment Summary

🔍 QC Diagnostics

Diagnostic Report

📚 GWAS Methods and Best-Practice Citations

📝 Methods Text (for publications)

🛠️ Tool Versions