27

How to filter, annotate, and extract variants from a VCF file using bcftools

I have a multi-sample VCF file from GATK with ~3 million variants. I need to: (1) filter by quality and depth, (2) select only SNPs (not INDELs), (3) subset to variants in a specific gene list, and (4) extract genotypes to a tabular format. How do I do all of this with bcftools?
1 views asked 2 months ago by Admin
1 Answer
22
✓ Accepted Answer
Here is a complete bcftools workflow: ```bash # 1. Index the VCF (required for region queries) bgzip variants.vcf tabix -p vcf variants.vcf.gz # 2. View basic stats bcftools stats variants.vcf.gz | grep ^SN # 3. Filter by quality and depth bcftools filter -e 'QUAL<30 || INFO/DP<10 || INFO/DP>1000' -s LOWQUAL variants.vcf.gz | bcftools view -f PASS -o filtered.vcf.gz -Oz tabix filtered.vcf.gz # 4. Keep only SNPs (exclude INDELs and MNPs) bcftools view --type snps filtered.vcf.gz -o snps_only.vcf.gz -Oz # 5. Filter by minor allele frequency bcftools view -q 0.01:minor snps_only.vcf.gz -o maf_filtered.vcf.gz -Oz # 6. Subset to specific samples bcftools view -s sample1,sample2,sample3 maf_filtered.vcf.gz -o subset.vcf.gz -Oz # 7. Extract to tabular format bcftools query -f '%CHROMt%POSt%REFt%ALTt%QUALt[%GTt]n' maf_filtered.vcf.gz > genotypes.tsv # 8. Annotate with gene names (requires BED/GFF annotation) bcftools annotate --annotations gene_annotations.vcf.gz --columns INFO/GENE maf_filtered.vcf.gz -o annotated.vcf.gz -Oz ``` **Filter expression syntax** (`-e` = exclude, `-i` = include): ```bash # Common filter expressions: -e 'QUAL<30' # low quality -e 'INFO/DP<10' # low depth -e 'FORMAT/GQ[*] < 20' # low genotype quality (any sample) -i 'INFO/AF > 0.05' # include common variants only ``` For VEP-style functional annotation, pipe through `bcftools csq` or use Ensembl VEP.
answered 3 weeks ago by Admin