38

How to call SNPs and INDELs with GATK4 HaplotypeCaller — complete pipeline

I'm trying to run a germline variant calling pipeline on whole-genome sequencing data. I have BWA-MEM2 aligned and sorted BAM files. What is the correct GATK4 HaplotypeCaller command order, and what preprocessing steps are mandatory before calling variants?
1 views asked 1 month ago by Admin
1 Answer
31
✓ Accepted Answer
Here is the complete GATK4 best-practices germline variant calling pipeline: **Step 1 — Mark duplicates** ```bash gatk MarkDuplicates -I aligned_sorted.bam -O marked_dups.bam -M dup_metrics.txt --ASSUME_SORT_ORDER coordinate ``` **Step 2 — Base Quality Score Recalibration (BQSR)** ```bash # Build recalibration table gatk BaseRecalibrator -I marked_dups.bam -R reference.fasta --known-sites dbsnp_146.hg38.vcf.gz --known-sites Mills_and_1000G_gold_standard.indels.hg38.vcf.gz -O recal_data.table # Apply recalibration gatk ApplyBQSR -I marked_dups.bam -R reference.fasta --bqsr-recal-file recal_data.table -O recalibrated.bam ``` **Step 3 — HaplotypeCaller in GVCF mode** ```bash gatk HaplotypeCaller -I recalibrated.bam -R reference.fasta -O sample.g.vcf.gz -ERC GVCF --dbsnp dbsnp_146.hg38.vcf.gz ``` **Step 4 — Genotype GVCFs** ```bash gatk GenotypeGVCFs -R reference.fasta -V sample.g.vcf.gz -O raw_variants.vcf.gz ``` **Step 5 — VQSR filtering** (for WGS with >30 samples) ```bash # For small cohorts, use hard filtering instead: gatk VariantFiltration -V raw_variants.vcf.gz --filter-expression "QD < 2.0" --filter-name "QD2" --filter-expression "FS > 60.0" --filter-name "FS60" --filter-expression "MQ < 40.0" --filter-name "MQ40" -O filtered_variants.vcf.gz ``` All reference files (dbSNP, Mills) are available from the GATK resource bundle: `gs://gatk-best-practices/somatic-hg38/`.
answered 4 weeks ago by Admin