Generated: 2026-03-30 17:49:01
| Metric | Mean | Median | SD | Min | Max |
|---|---|---|---|---|---|
| Call Rate | 0.9813 | 0.9819 | 0.0141 | 0.9388 | 0.9998 |
| LRR SD | 0.2072 | 0.1993 | 0.0838 | 0.0825 | 0.4123 |
| LRR Mean | -0.0017 | -0.0013 | 0.0107 | -0.0188 | 0.0151 |
| LRR Median | -0.0021 | -0.0002 | 0.0117 | -0.0216 | 0.0184 |
| BAF SD (het sites) | 0.0636 | 0.0573 | 0.0363 | 0.0205 | 0.1823 |
| Heterozygosity Rate | 0.2503 | 0.2477 | 0.0301 | 0.1534 | 0.3412 |
| Metric | Stage 1 Mean | Stage 2 Mean | Change |
|---|---|---|---|
| Call Rate | 0.9813 | 0.9813 | +0.0000 |
| LRR SD | 0.2072 | 0.2072 | +0.0000 |
| BAF SD | 0.0636 | 0.0636 | +0.0000 |
| Het Rate | 0.2503 | 0.2503 | +0.0000 |
Reference thresholds based on Anderson et al. (2010), Marees et al. (2018), and Turner et al. (2011). These defaults are used to highlight samples and variants requiring attention, and should be interpreted with study design, ancestry composition, and downstream analysis goals in mind.
| Metric | Threshold | Type | Rationale / Evidence |
|---|---|---|---|
| Sample Call Rate | โฅ 0.97 | Pass/Fail | Low call rate indicates poor DNA quality or failed hybridization (Anderson 2010; Marees 2018) |
| LRR SD | โค 0.35 | Pass/Fail | High LRR SD indicates noisy intensity data unsuitable for CNV/GWAS (Turner 2011) |
| BAF SD (het sites) | โค 0.15 | Flag | Elevated BAF SD may indicate contamination or noisy heterozygous intensity clusters (Turner 2011; Marees 2018) |
| Heterozygosity Rate | Within ยฑ 3 SD | Flag | Outliers may indicate contamination (high) or inbreeding (low); interpret against cohort ancestry structure (Anderson 2010; Marees 2018) |
| Variant Call Rate | โฅ 0.98 | Variant QC | Poorly genotyped variants produce spurious associations (Anderson 2010; Turner 2011) |
| HWE p-value | โฅ 1 ร 10โปโถ | Variant QC | Extreme HWE deviation indicates genotyping errors (Anderson 2010; Marees 2018) |
| MAF | โฅ 0.01 | Variant QC | Rare variants have low power and higher error sensitivity in GWAS; PCA step uses MAF โฅ 0.05 for stable loadings (Marees 2018) |
| Inbreeding F | |F| โค 0.05 | Flag | F-statistic outliers may indicate sample quality or relatedness issues (Turner 2011) |
Key references: Anderson et al. 2010 (doi:10.1038/nprot.2010.116), Marees et al. 2018 (doi:10.1002/mpr.1608), Turner et al. 2011 (doi:10.1002/0471142905.hg0119s68).
| Sample ID | Call Rate | LRR SD | LRR Mean | LRR Median | BAF SD | Het Rate | Sex | Status | Flags | Failure Reason |
|---|
================================================== Stage 1 โ Stage 2 QC Comparison ================================================== Metric Stage 1 Stage 2 Change โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโ โโโโโโโโโโโ โโโโโโโโโโ Mean Call Rate 0.9841 0.9893 +0.0052 Mean LRR SD 0.2312 0.2045 -0.0267 Mean BAF SD 0.0612 0.0534 -0.0078 Mean Het Rate 0.2489 0.2478 -0.0011 Variant Missingness 0.0134 0.0098 -0.0036 HWE failures (1e-6) 1,245 823 -422 Mean MAF 0.1723 0.1734 +0.0011 Ti/Tv ratio 2.05 2.07 +0.02 Reclustering improved call rate for 28/30 samples and reduced LRR SD for 26/30 samples.
Variant-level QC metrics are shown for the full cohort ("All") and for each ancestry group meeting the minimum sample threshold. Ancestry-stratified HWE testing avoids inflated deviation from the Wahlund effect in mixed-ancestry samples (Anderson et al. 2010; Marees et al. 2018). Cross-ancestry pass metrics summarize whether each variant passes call rate, HWE, MAF, and all metrics combined in every ancestry subset analyzed.
==================================================
Variant QC Summary (autosomes)
==================================================
Input variants: 650,427
After call-rate filter: 642,185 (โฅ 0.98)
After HWE filter: 639,412 (p โฅ 1e-6)
After MAF filter: 531,208 (โฅ 0.01)
Ti/Tv ratio: 2.07
Mean inbreeding F: 0.0012
Inbreeding F range: -0.032 to 0.041
Missingness histogram:
0.00-0.01: โโโโโโโโโโโโโโโโโโโโโโโโ 598,127
0.01-0.02: โโโโ 44,058
0.02-0.05: โโ 6,923
0.05-0.10: โ 1,062
0.10-1.00: โ 257
MAF distribution:
0.00-0.01: โโโโโโโโโโโโโโโโ 108,977
0.01-0.05: โโโโโโโโโโโโโโโโโโโโ 132,245
0.05-0.10: โโโโโโโโโโโโโโ 98,456
0.10-0.25: โโโโโโโโโโโโโโโโโโโโโโโโ 164,123
0.25-0.50: โโโโโโโโโโโโโโโโโโโโโโโโ 146,626
====================================================== Ancestry-Stratified QC Summary ====================================================== Ancestry Groups Analyzed: AFR: 10 samples EAS: 10 samples EUR: 10 samples Collated variant QC: /tmp/mock_output/ancestry_stratified_qc/collated_variant_qc.tsv
Peddy validates sex, ancestry, and relatedness from VCF genotypes. Ancestry predictions are colored by predicted population (EUR, AFR, EAS, AMR, SAS). Note: Peddy PCs are projected onto the 1000 Genomes reference panel and are in a separate coordinate space from the pipeline’s own PCA — the two are not directly comparable.
================================================== PCA QC Summary ================================================== Input samples: 30 After call rate โฅ 0.98: 28 After mind โค 0.02: 28 Input variants: 650,427 After geno โค 0.02: 642,185 After HWE (1e-6): 639,412 After MAF โฅ 0.05: 485,234 After LD pruning: 98,412 PCs computed: 20 Eigenvalues: 12.34, 8.56, 5.23, 3.89, 2.67, ...
==================================================
Manifest Realignment Summary
==================================================
Manifest: GSA-24v3-0_A1
Reference: chm13v2.0 (T2T-CHM13v2.0)
Aligner: bwa 0.7.18
Total probes: 654,027
Mapped: 650,427 (99.45%)
Unmapped: 2,134 (0.33%)
Multi-mapped: 1,466 (0.22%)
Position changes: 3,812 (0.58%)
Same chromosome, different position: 3,701
Different chromosome: 111
Strand changes: 245
New placements: 89 (previously unmapped)
==================================================
QC Diagnostic Report
==================================================
โ Call rate distribution appears normal
โ LRR SD distribution is within expected range
โ 2 samples flagged for BAF SD > 0.15 (possible contamination)
โ 2 samples are heterozygosity rate outliers (ยฑ 3 SD)
โ Gender predictions consistent (15 M, 15 F)
โ Ti/Tv ratio 2.07 is within expected range (1.8-2.2)
โ No build mismatch detected between manifest and reference
Recommendations:
- Review BAF SD-flagged samples (SAMPLE_008, SAMPLE_019) for
possible contamination or mosaicism
- Review het rate outliers (SAMPLE_010, SAMPLE_021) for
population stratification or sample swaps
Compiled citation summary for methods, tools, and algorithms used in this
pipeline. The same content is also exported as citations_summary.tsv.
| Citation | DOI | Cited for | Pipeline application |
|---|---|---|---|
| Anderson et al. 2010, Nat Protoc 5:1564-1573 | doi:10.1038/nprot.2010.116 | Foundational GWAS sample/variant QC guidance | Supports sample call rate and heterozygosity outlier review; supports variant call rate and HWE filtering guidance. |
| Marees et al. 2018, Int J Methods Psychiatr Res 27:e1608 | doi:10.1002/mpr.1608 | Modern practical GWAS QC workflow recommendations | Supports contamination/noise interpretation (BAF SD, heterozygosity), variant QC thresholds, and stricter PCA MAF filtering for stability. |
| Turner et al. 2011, Curr Protoc Hum Genet Unit 1.19 | doi:10.1002/0471142905.hg0119s68 | Array-focused genotype QC and filtering practice | Supports LRR/BAF quality interpretation, variant missingness thresholds, and inbreeding/relatedness flagging context. |
| Danecek et al. 2021, Gigascience 10(2):giab008 | doi:10.1093/gigascience/giab008 | bcftools framework for variant processing | Supports IDAT-to-GTC/VCF processing and downstream variant manipulations. |
| Li and Durbin 2009, Bioinformatics 25(14):1754-1760 | doi:10.1093/bioinformatics/btp324 | BWA alignment algorithm | Supports manifest probe flank realignment to the selected reference genome. |
| Chang et al. 2015, Gigascience 4:7 | doi:10.1186/s13742-015-0047-8 | PLINK analytical framework for large-scale genotype QC | Supports variant-level QC metrics, HWE/missingness/MAF filtering, and LD pruning in ancestry PCA preparation. |
| Abraham et al. 2017, Bioinformatics 33(17):2776-2778 | doi:10.1093/bioinformatics/btx299 | flashpca2 randomized PCA implementation | Supports computationally efficient ancestry PCA for large genotype matrices. |
| Pedersen and Quinlan 2017, Am J Hum Genet 100(3):406-413 | doi:10.1016/j.ajhg.2017.01.017 | peddy pedigree/sex/ancestry QC | Supports automated sex check, ancestry prediction, and relatedness validation from VCF genotypes via the peddy tool. |
| Genovese et al. 2024, Bioinformatics 40(2):btae038 | doi:10.1093/bioinformatics/btae038 | bcftools/liftover assembly-coordinate conversion | Supports coordinate conversion from non-GRCh38 builds to GRCh38 for peddy site matching via bcftools +liftover. |
| Peterson et al. 2019, Am J Hum Genet 105(5):921-935 | doi:10.1016/j.ajhg.2019.09.022 | Within-ancestry PCA and ancestry-stratified QC | Supports ancestry-stratified variant QC and within-ancestry PCA to resolve finer population structure and avoid Wahlund effect in HWE testing across mixed-ancestry cohorts. |
Genotyping intensity data (IDAT files) were processed using the Illumina IDAT Processing Pipeline. Raw IDAT files were converted to GTC format using the GenCall algorithm (bcftools +idat2gtc), then to VCF format with B Allele Frequency (BAF) and Log R Ratio (LRR) intensities (bcftools +gtc2vcf). Probe coordinates were validated by realigning CSV manifest flank sequences against the CHM13 reference genome using BWA-MEM.
A two-stage genotyping approach was employed. Stage 1 genotype calls were generated using manufacturer-provided EGT cluster definitions. Per-sample QC metrics (call rate, LRR standard deviation, BAF standard deviation at heterozygous sites, and heterozygosity rate) were computed on autosomal variants. Following common GWAS QC guidance (Anderson et al., 2010; Marees et al., 2018; Turner et al., 2011), high-quality samples (call rate โฅ 0.97 and LRR SD โค 0.35; n = 24 of 30) were used to recompute study-specific genotype cluster definitions (EGT file) in Stage 2. All samples were then re-genotyped using the study-specific clusters, with additional BAF/LRR median adjustment (--adjust-clusters).
Sample ancestry was predicted using peddy (Pedersen and Quinlan, 2017), which projects samples onto 1000 Genomes principal components for ancestry classification. Ancestry principal components were computed using stringent variant QC (missingness < 2%, HWE p โฅ 1e-6, MAF โฅ 5%), LD pruning (window = 1000 kb, step = 1, rยฒ < 0.1), and flashpca2. This stricter PCA MAF threshold was used to improve loading stability. For each ancestry group with sufficient sample size (โฅ 100 by default), ancestry-stratified variant QC (missingness, HWE, allele frequency) was performed to avoid inflated HWE deviation due to the Wahlund effect in mixed-ancestry samples (Anderson et al., 2010). Within-ancestry PCA was also computed to resolve finer population structure masked in multi-ancestry PCA (Peterson et al., 2019). Ancestry-specific PCs and variant QC metrics are collated alongside the full-cohort results, including cross-ancestry pass flags for call rate, HWE, MAF, and overall QC. Variant-level QC included per-variant missingness, Hardy-Weinberg equilibrium tests (mid-p adjustment), allele frequency estimation, per-sample inbreeding coefficients (F statistic), and transition/transversion ratio computation (plink2). After processing, the mean call rate was 0.9813 and the mean LRR SD was 0.2072.
bcftools: not installedplink2: not installedbwa: not installedflashpca: not installedpeddy: /opt/hostedtoolcache/Python/3.11.15/x64/bin/python3: No module named peddy