Fastest way to parse a large VCF file in Python for GWAS analysis?

Question

I have a VCF file with ~15 million SNPs and 5000 samples (~40 GB). I need to extract allele frequencies and filter by MAF > 0.01. Using PyVCF is extremely slow. What's the fastest approach?

Admin · Accepted Answer

Use **cyvcf2** — it's ~20x faster than PyVCF because it wraps htslib in C:

```python
from cyvcf2 import VCF
import numpy as np

vcf = VCF('variants.vcf.gz')  # works with bgzipped+tabix indexed

maf_threshold = 0.01
results = []

for variant in vcf:
    if variant.var_type != 'snp':
        continue
    afs = variant.aaf  # alternate allele frequency
    maf = min(afs, 1 - afs)
    if maf >= maf_threshold:
        results.append({
            'chrom': variant.CHROM,
            'pos':   variant.POS,
            'ref':   variant.REF,
            'alt':   variant.ALT[0],
            'maf':   maf,
        })

vcf.close()
```

For 40 GB VCFs, also consider:
1. **PLINK2** for the heavy lifting — it's purpose-built for GWAS and handles this in seconds
2. **hail** (PySpark-based) for truly massive datasets
3. Pre-index with `tabix` so you can query by region