27
Fastest way to parse a large VCF file in Python for GWAS analysis?
I have a VCF file with ~15 million SNPs and 5000 samples (~40 GB). I need to extract allele frequencies and filter by MAF > 0.01. Using PyVCF is extremely slow. What's the fastest approach?
3 views
1 Answer
22
✓
✓ Accepted Answer
Use **cyvcf2** — it's ~20x faster than PyVCF because it wraps htslib in C:
```python
from cyvcf2 import VCF
import numpy as np
vcf = VCF('variants.vcf.gz') # works with bgzipped+tabix indexed
maf_threshold = 0.01
results = []
for variant in vcf:
if variant.var_type != 'snp':
continue
afs = variant.aaf # alternate allele frequency
maf = min(afs, 1 - afs)
if maf >= maf_threshold:
results.append({
'chrom': variant.CHROM,
'pos': variant.POS,
'ref': variant.REF,
'alt': variant.ALT[0],
'maf': maf,
})
vcf.close()
```
For 40 GB VCFs, also consider:
1. **PLINK2** for the heavy lifting — it's purpose-built for GWAS and handles this in seconds
2. **hail** (PySpark-based) for truly massive datasets
3. Pre-index with `tabix` so you can query by region