๐Ÿงฌ Illumina IDAT Processing Pipeline Report

Generated: 2026-03-30 17:49:01

Example Output (Mock Data): This report is an example generated from deterministic mock data for demonstration purposes only.

๐Ÿ“Š Overview

30
Total Samples
24
Passing QC
6
Failing QC
15 / 15
Male / Female
0.9813
Mean Call Rate
0.2072
Mean LRR SD
2
BAF SD Flags
2
Het Rate Outliers
QC Pass Rate
80.0% (24 of 30)

Summary Statistics (Stage 2)

MetricMeanMedianSDMinMax
Call Rate0.98130.98190.01410.93880.9998
LRR SD0.20720.19930.08380.08250.4123
LRR Mean-0.0017-0.00130.0107-0.01880.0151
LRR Median-0.0021-0.00020.0117-0.02160.0184
BAF SD (het sites)0.06360.05730.03630.02050.1823
Heterozygosity Rate0.25030.24770.03010.15340.3412

Stage 1 โ†’ Stage 2 Comparison

MetricStage 1 MeanStage 2 MeanChange
Call Rate0.98130.9813+0.0000
LRR SD0.20720.2072+0.0000
BAF SD0.06360.0636+0.0000
Het Rate0.25030.2503+0.0000

๐ŸŽฏ GWAS QC Best Practice Thresholds

Reference thresholds based on Anderson et al. (2010), Marees et al. (2018), and Turner et al. (2011). These defaults are used to highlight samples and variants requiring attention, and should be interpreted with study design, ancestry composition, and downstream analysis goals in mind.

MetricThresholdTypeRationale / Evidence
Sample Call Rateโ‰ฅ 0.97 Pass/Fail Low call rate indicates poor DNA quality or failed hybridization (Anderson 2010; Marees 2018)
LRR SDโ‰ค 0.35 Pass/Fail High LRR SD indicates noisy intensity data unsuitable for CNV/GWAS (Turner 2011)
BAF SD (het sites)โ‰ค 0.15 Flag Elevated BAF SD may indicate contamination or noisy heterozygous intensity clusters (Turner 2011; Marees 2018)
Heterozygosity RateWithin ยฑ 3 SD Flag Outliers may indicate contamination (high) or inbreeding (low); interpret against cohort ancestry structure (Anderson 2010; Marees 2018)
Variant Call Rateโ‰ฅ 0.98 Variant QC Poorly genotyped variants produce spurious associations (Anderson 2010; Turner 2011)
HWE p-valueโ‰ฅ 1 ร— 10โปโถ Variant QC Extreme HWE deviation indicates genotyping errors (Anderson 2010; Marees 2018)
MAFโ‰ฅ 0.01 Variant QC Rare variants have low power and higher error sensitivity in GWAS; PCA step uses MAF โ‰ฅ 0.05 for stable loadings (Marees 2018)
Inbreeding F|F| โ‰ค 0.05 Flag F-statistic outliers may indicate sample quality or relatedness issues (Turner 2011)

Key references: Anderson et al. 2010 (doi:10.1038/nprot.2010.116), Marees et al. 2018 (doi:10.1002/mpr.1608), Turner et al. 2011 (doi:10.1002/0471142905.hg0119s68).

๐Ÿ“ˆ Interactive Sample QC Dashboard

๐Ÿ“‹ Per-Sample QC Details

30 of 30 samples
Sample ID Call Rate LRR SD LRR Mean LRR Median BAF SD Het Rate Sex Status Flags Failure Reason

๐Ÿ”„ Reclustering Comparison

Comparison Report

==================================================
  Stage 1 โ†’ Stage 2 QC Comparison
==================================================

  Metric              Stage 1     Stage 2     Change
  โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
  Mean Call Rate       0.9841      0.9893      +0.0052
  Mean LRR SD          0.2312      0.2045      -0.0267
  Mean BAF SD          0.0612      0.0534      -0.0078
  Mean Het Rate        0.2489      0.2478      -0.0011

  Variant Missingness  0.0134      0.0098      -0.0036
  HWE failures (1e-6)  1,245       823         -422
  Mean MAF             0.1723      0.1734      +0.0011
  Ti/Tv ratio          2.05        2.07        +0.02

  Reclustering improved call rate for 28/30 samples
  and reduced LRR SD for 26/30 samples.

๐Ÿงช Variant QC

Variant-level QC metrics are shown for the full cohort ("All") and for each ancestry group meeting the minimum sample threshold. Ancestry-stratified HWE testing avoids inflated deviation from the Wahlund effect in mixed-ancestry samples (Anderson et al. 2010; Marees et al. 2018). Cross-ancestry pass metrics summarize whether each variant passes call rate, HWE, MAF, and all metrics combined in every ancestry subset analyzed.

Variant QC Summary (All Samples)

==================================================
  Variant QC Summary (autosomes)
==================================================

  Input variants:         650,427
  After call-rate filter: 642,185 (โ‰ฅ 0.98)
  After HWE filter:       639,412 (p โ‰ฅ 1e-6)
  After MAF filter:       531,208 (โ‰ฅ 0.01)

  Ti/Tv ratio:            2.07
  Mean inbreeding F:      0.0012
  Inbreeding F range:     -0.032 to 0.041

  Missingness histogram:
    0.00-0.01:  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ  598,127
    0.01-0.02:  โ–ˆโ–ˆโ–ˆโ–ˆ                       44,058
    0.02-0.05:  โ–ˆโ–ˆ                          6,923
    0.05-0.10:  โ–                           1,062
    0.10-1.00:  โ–                             257

  MAF distribution:
    0.00-0.01:  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ           108,977
    0.01-0.05:  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ       132,245
    0.05-0.10:  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ             98,456
    0.10-0.25:  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ  164,123
    0.25-0.50:  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ  146,626
Range: 0.001.00
Range: 0.000.50
Range: 0.050.0

Cross-Ancestry Variant QC Pass Summary

Not available. This summary requires collated_variant_qc.tsv to include all_ancestries_*_pass columns from ancestry-stratified QC (or at least one qualifying ancestry group).

Ancestry-Stratified QC Summary

======================================================
  Ancestry-Stratified QC Summary
======================================================

Ancestry Groups Analyzed:
  AFR: 10 samples
  EAS: 10 samples
  EUR: 10 samples

Collated variant QC: /tmp/mock_output/ancestry_stratified_qc/collated_variant_qc.tsv

โšฅ Sex Check

Interpretation guide: Females (XX) show higher median chrX LRR (near 0) and low chrY LRR (negative, reflecting no chrY). Males (XY) show lower median chrX LRR (slightly negative, hemizygous X) and higher chrY LRR (near 0 or slightly positive). Outliers may indicate sex discordance, sex chromosome aneuploidies (XXY: high X and Y; XO: very low X with absent Y signal), or sample swaps. Samples outside the expected sex-specific clusters warrant manual review.

๐Ÿ”ฌ Peddy QC (Pedigree/Sex/Ancestry)

Peddy validates sex, ancestry, and relatedness from VCF genotypes. Ancestry predictions are colored by predicted population (EUR, AFR, EAS, AMR, SAS). Note: Peddy PCs are projected onto the 1000 Genomes reference panel and are in a separate coordinate space from the pipeline’s own PCA — the two are not directly comparable.

Peddy ancestry PCA projections colored by predicted population (peddy’s own coordinate space). Use “Color by peddy ancestry” in the Ancestry PCA section to recolor the pipeline’s own samples by their peddy ancestry predictions without mixing coordinate spaces.

๐ŸŒ Ancestry PCA

No clusters computed.
Toggle Scatter/Density views, switch 2D/3D, and run customizable k-means clustering for ancestry grouping. “Color by peddy ancestry” recolors the pipeline’s own samples by their peddy-predicted ancestry (no coordinate mixing). “Show peddy PCs” reveals a separate section with peddy’s own PCA plots (peddy coordinate space, not comparable to pipeline PCs).

PCA QC Summary

==================================================
  PCA QC Summary
==================================================

  Input samples:          30
  After call rate โ‰ฅ 0.98: 28
  After mind โ‰ค 0.02:      28

  Input variants:         650,427
  After geno โ‰ค 0.02:      642,185
  After HWE (1e-6):       639,412
  After MAF โ‰ฅ 0.05:       485,234
  After LD pruning:       98,412

  PCs computed: 20
  Eigenvalues: 12.34, 8.56, 5.23, 3.89, 2.67, ...

๐Ÿงญ Manifest Realignment

Realignment Summary

==================================================
  Manifest Realignment Summary
==================================================

  Manifest:         GSA-24v3-0_A1
  Reference:        chm13v2.0 (T2T-CHM13v2.0)
  Aligner:          bwa 0.7.18

  Total probes:     654,027
  Mapped:           650,427 (99.45%)
  Unmapped:         2,134 (0.33%)
  Multi-mapped:     1,466 (0.22%)

  Position changes: 3,812 (0.58%)
    Same chromosome, different position:  3,701
    Different chromosome:                    111

  Strand changes:   245
  New placements:   89 (previously unmapped)

๐Ÿ” QC Diagnostics

Diagnostic Report

==================================================
  QC Diagnostic Report
==================================================

  โœ“ Call rate distribution appears normal
  โœ“ LRR SD distribution is within expected range
  โš  2 samples flagged for BAF SD > 0.15 (possible contamination)
  โš  2 samples are heterozygosity rate outliers (ยฑ 3 SD)
  โœ“ Gender predictions consistent (15 M, 15 F)
  โœ“ Ti/Tv ratio 2.07 is within expected range (1.8-2.2)
  โœ“ No build mismatch detected between manifest and reference

  Recommendations:
  - Review BAF SD-flagged samples (SAMPLE_008, SAMPLE_019) for
    possible contamination or mosaicism
  - Review het rate outliers (SAMPLE_010, SAMPLE_021) for
    population stratification or sample swaps

๐Ÿ“š GWAS Methods and Best-Practice Citations

Compiled citation summary for methods, tools, and algorithms used in this pipeline. The same content is also exported as citations_summary.tsv.

CitationDOICited forPipeline application
Anderson et al. 2010, Nat Protoc 5:1564-1573doi:10.1038/nprot.2010.116Foundational GWAS sample/variant QC guidanceSupports sample call rate and heterozygosity outlier review; supports variant call rate and HWE filtering guidance.
Marees et al. 2018, Int J Methods Psychiatr Res 27:e1608doi:10.1002/mpr.1608Modern practical GWAS QC workflow recommendationsSupports contamination/noise interpretation (BAF SD, heterozygosity), variant QC thresholds, and stricter PCA MAF filtering for stability.
Turner et al. 2011, Curr Protoc Hum Genet Unit 1.19doi:10.1002/0471142905.hg0119s68Array-focused genotype QC and filtering practiceSupports LRR/BAF quality interpretation, variant missingness thresholds, and inbreeding/relatedness flagging context.
Danecek et al. 2021, Gigascience 10(2):giab008doi:10.1093/gigascience/giab008bcftools framework for variant processingSupports IDAT-to-GTC/VCF processing and downstream variant manipulations.
Li and Durbin 2009, Bioinformatics 25(14):1754-1760doi:10.1093/bioinformatics/btp324BWA alignment algorithmSupports manifest probe flank realignment to the selected reference genome.
Chang et al. 2015, Gigascience 4:7doi:10.1186/s13742-015-0047-8PLINK analytical framework for large-scale genotype QCSupports variant-level QC metrics, HWE/missingness/MAF filtering, and LD pruning in ancestry PCA preparation.
Abraham et al. 2017, Bioinformatics 33(17):2776-2778doi:10.1093/bioinformatics/btx299flashpca2 randomized PCA implementationSupports computationally efficient ancestry PCA for large genotype matrices.
Pedersen and Quinlan 2017, Am J Hum Genet 100(3):406-413doi:10.1016/j.ajhg.2017.01.017peddy pedigree/sex/ancestry QCSupports automated sex check, ancestry prediction, and relatedness validation from VCF genotypes via the peddy tool.
Genovese et al. 2024, Bioinformatics 40(2):btae038doi:10.1093/bioinformatics/btae038bcftools/liftover assembly-coordinate conversionSupports coordinate conversion from non-GRCh38 builds to GRCh38 for peddy site matching via bcftools +liftover.
Peterson et al. 2019, Am J Hum Genet 105(5):921-935doi:10.1016/j.ajhg.2019.09.022Within-ancestry PCA and ancestry-stratified QCSupports ancestry-stratified variant QC and within-ancestry PCA to resolve finer population structure and avoid Wahlund effect in HWE testing across mixed-ancestry cohorts.

๐Ÿ“ Methods Text (for publications)

Genotyping intensity data (IDAT files) were processed using the Illumina IDAT Processing Pipeline. Raw IDAT files were converted to GTC format using the GenCall algorithm (bcftools +idat2gtc), then to VCF format with B Allele Frequency (BAF) and Log R Ratio (LRR) intensities (bcftools +gtc2vcf). Probe coordinates were validated by realigning CSV manifest flank sequences against the CHM13 reference genome using BWA-MEM.

A two-stage genotyping approach was employed. Stage 1 genotype calls were generated using manufacturer-provided EGT cluster definitions. Per-sample QC metrics (call rate, LRR standard deviation, BAF standard deviation at heterozygous sites, and heterozygosity rate) were computed on autosomal variants. Following common GWAS QC guidance (Anderson et al., 2010; Marees et al., 2018; Turner et al., 2011), high-quality samples (call rate โ‰ฅ 0.97 and LRR SD โ‰ค 0.35; n = 24 of 30) were used to recompute study-specific genotype cluster definitions (EGT file) in Stage 2. All samples were then re-genotyped using the study-specific clusters, with additional BAF/LRR median adjustment (--adjust-clusters).

Sample ancestry was predicted using peddy (Pedersen and Quinlan, 2017), which projects samples onto 1000 Genomes principal components for ancestry classification. Ancestry principal components were computed using stringent variant QC (missingness < 2%, HWE p โ‰ฅ 1e-6, MAF โ‰ฅ 5%), LD pruning (window = 1000 kb, step = 1, rยฒ < 0.1), and flashpca2. This stricter PCA MAF threshold was used to improve loading stability. For each ancestry group with sufficient sample size (โ‰ฅ 100 by default), ancestry-stratified variant QC (missingness, HWE, allele frequency) was performed to avoid inflated HWE deviation due to the Wahlund effect in mixed-ancestry samples (Anderson et al., 2010). Within-ancestry PCA was also computed to resolve finer population structure masked in multi-ancestry PCA (Peterson et al., 2019). Ancestry-specific PCs and variant QC metrics are collated alongside the full-cohort results, including cross-ancestry pass flags for call rate, HWE, MAF, and overall QC. Variant-level QC included per-variant missingness, Hardy-Weinberg equilibrium tests (mid-p adjustment), allele frequency estimation, per-sample inbreeding coefficients (F statistic), and transition/transversion ratio computation (plink2). After processing, the mean call rate was 0.9813 and the mean LRR SD was 0.2072.

๐Ÿ› ๏ธ Tool Versions

  • bcftools: not installed
  • plink2: not installed
  • bwa: not installed
  • flashpca: not installed
  • peddy: /opt/hostedtoolcache/Python/3.11.15/x64/bin/python3: No module named peddy