38
How to call SNPs and INDELs with GATK4 HaplotypeCaller — complete pipeline
I'm trying to run a germline variant calling pipeline on whole-genome sequencing data. I have BWA-MEM2 aligned and sorted BAM files. What is the correct GATK4 HaplotypeCaller command order, and what preprocessing steps are mandatory before calling variants?
3 views
1 Answer
31
✓
✓ Accepted Answer
Here is the complete GATK4 best-practices germline variant calling pipeline:
**Step 1 — Mark duplicates**
```bash
gatk MarkDuplicates
-I aligned_sorted.bam
-O marked_dups.bam
-M dup_metrics.txt
--ASSUME_SORT_ORDER coordinate
```
**Step 2 — Base Quality Score Recalibration (BQSR)**
```bash
# Build recalibration table
gatk BaseRecalibrator
-I marked_dups.bam
-R reference.fasta
--known-sites dbsnp_146.hg38.vcf.gz
--known-sites Mills_and_1000G_gold_standard.indels.hg38.vcf.gz
-O recal_data.table
# Apply recalibration
gatk ApplyBQSR
-I marked_dups.bam
-R reference.fasta
--bqsr-recal-file recal_data.table
-O recalibrated.bam
```
**Step 3 — HaplotypeCaller in GVCF mode**
```bash
gatk HaplotypeCaller
-I recalibrated.bam
-R reference.fasta
-O sample.g.vcf.gz
-ERC GVCF
--dbsnp dbsnp_146.hg38.vcf.gz
```
**Step 4 — Genotype GVCFs**
```bash
gatk GenotypeGVCFs
-R reference.fasta
-V sample.g.vcf.gz
-O raw_variants.vcf.gz
```
**Step 5 — VQSR filtering** (for WGS with >30 samples)
```bash
# For small cohorts, use hard filtering instead:
gatk VariantFiltration
-V raw_variants.vcf.gz
--filter-expression "QD < 2.0" --filter-name "QD2"
--filter-expression "FS > 60.0" --filter-name "FS60"
--filter-expression "MQ < 40.0" --filter-name "MQ40"
-O filtered_variants.vcf.gz
```
All reference files (dbSNP, Mills) are available from the GATK resource bundle: `gs://gatk-best-practices/somatic-hg38/`.