27
How to filter, annotate, and extract variants from a VCF file using bcftools
I have a multi-sample VCF file from GATK with ~3 million variants. I need to: (1) filter by quality and depth, (2) select only SNPs (not INDELs), (3) subset to variants in a specific gene list, and (4) extract genotypes to a tabular format. How do I do all of this with bcftools?
3 views
1 Answer
22
✓
✓ Accepted Answer
Here is a complete bcftools workflow:
```bash
# 1. Index the VCF (required for region queries)
bgzip variants.vcf
tabix -p vcf variants.vcf.gz
# 2. View basic stats
bcftools stats variants.vcf.gz | grep ^SN
# 3. Filter by quality and depth
bcftools filter
-e 'QUAL<30 || INFO/DP<10 || INFO/DP>1000'
-s LOWQUAL
variants.vcf.gz |
bcftools view -f PASS -o filtered.vcf.gz -Oz
tabix filtered.vcf.gz
# 4. Keep only SNPs (exclude INDELs and MNPs)
bcftools view --type snps filtered.vcf.gz -o snps_only.vcf.gz -Oz
# 5. Filter by minor allele frequency
bcftools view -q 0.01:minor snps_only.vcf.gz -o maf_filtered.vcf.gz -Oz
# 6. Subset to specific samples
bcftools view -s sample1,sample2,sample3 maf_filtered.vcf.gz -o subset.vcf.gz -Oz
# 7. Extract to tabular format
bcftools query -f '%CHROMt%POSt%REFt%ALTt%QUALt[%GTt]n'
maf_filtered.vcf.gz > genotypes.tsv
# 8. Annotate with gene names (requires BED/GFF annotation)
bcftools annotate
--annotations gene_annotations.vcf.gz
--columns INFO/GENE
maf_filtered.vcf.gz -o annotated.vcf.gz -Oz
```
**Filter expression syntax** (`-e` = exclude, `-i` = include):
```bash
# Common filter expressions:
-e 'QUAL<30' # low quality
-e 'INFO/DP<10' # low depth
-e 'FORMAT/GQ[*] < 20' # low genotype quality (any sample)
-i 'INFO/AF > 0.05' # include common variants only
```
For VEP-style functional annotation, pipe through `bcftools csq` or use Ensembl VEP.