27

Fastest way to parse a large VCF file in Python for GWAS analysis?

I have a VCF file with ~15 million SNPs and 5000 samples (~40 GB). I need to extract allele frequencies and filter by MAF > 0.01. Using PyVCF is extremely slow. What's the fastest approach?
3 views asked 1 day ago by Admin
1 Answer
22
✓ Accepted Answer
Use **cyvcf2** — it's ~20x faster than PyVCF because it wraps htslib in C: ```python from cyvcf2 import VCF import numpy as np vcf = VCF('variants.vcf.gz') # works with bgzipped+tabix indexed maf_threshold = 0.01 results = [] for variant in vcf: if variant.var_type != 'snp': continue afs = variant.aaf # alternate allele frequency maf = min(afs, 1 - afs) if maf >= maf_threshold: results.append({ 'chrom': variant.CHROM, 'pos': variant.POS, 'ref': variant.REF, 'alt': variant.ALT[0], 'maf': maf, }) vcf.close() ``` For 40 GB VCFs, also consider: 1. **PLINK2** for the heavy lifting — it's purpose-built for GWAS and handles this in seconds 2. **hail** (PySpark-based) for truly massive datasets 3. Pre-index with `tabix` so you can query by region
answered 3 days ago by Admin