How to estimate genome size from sequencing reads using k-mer analysis with Jellyfish and GenomeScope

Question

I have Illumina WGS reads for a non-model organism and want to estimate the genome size before assembly. I've heard about k-mer frequency analysis. How do I use Jellyfish and GenomeScope2 to estimate genome size, heterozygosity, and repeat content?

Admin · Accepted Answer

K-mer frequency analysis is the standard pre-assembly genome profiling approach. Here is the full workflow:

**Step 1 — Count k-mers with Jellyfish**
```bash
# Count canonical k-mers (k=21 is standard)
jellyfish count 
  -C -m 21 -s 5G -t 16 
  -o kmer_counts.jf 
   kmer_histo.txt
```

**Step 2 — Analyze with GenomeScope2**
```bash
# Install: pip install genomescope2
genomescope2 
  -i kmer_histo.txt 
  -o genomescope_output 
  -k 21 
  -p 2  # ploidy: 1=haploid, 2=diploid
```

**What GenomeScope2 reports:**
- **Genome size (haploid)**: total unique sequence
- **Heterozygosity**: proportion of heterozygous sites
- **Repeat content**: estimated repetitive fraction
- **k-mer coverage**: which feeds into assembly parameters

**Interpreting the histogram:**
- First small peak: sequencing errors (k-mers appearing only once)
- Main peak: heterozygous k-mers (at ~half-coverage)
- Second peak: homozygous k-mers (at ~full-coverage)

For polyploids, use `smudgeplots` alongside GenomeScope2 to determine ploidy level.

```python
# Quick Python estimate from histogram
import pandas as pd
histo = pd.read_csv('kmer_histo.txt', sep=' ', header=None, names=['freq','count'])
# Peak coverage × total distinct k-mers ≈ genome size
peak_cov = histo.loc[histo['count'].idxmax(), 'freq']
genome_size_bp = histo['count'].sum() / peak_cov
print(f'Estimated genome size: {genome_size_bp/1e6:.1f} Mb')
```