25
How to estimate genome size from sequencing reads using k-mer analysis with Jellyfish and GenomeScope
I have Illumina WGS reads for a non-model organism and want to estimate the genome size before assembly. I've heard about k-mer frequency analysis. How do I use Jellyfish and GenomeScope2 to estimate genome size, heterozygosity, and repeat content?
4 views
1 Answer
21
✓
✓ Accepted Answer
K-mer frequency analysis is the standard pre-assembly genome profiling approach. Here is the full workflow:
**Step 1 — Count k-mers with Jellyfish**
```bash
# Count canonical k-mers (k=21 is standard)
jellyfish count
-C -m 21 -s 5G -t 16
-o kmer_counts.jf
<(zcat reads_R1.fastq.gz reads_R2.fastq.gz)
# Export histogram
jellyfish histo -t 16 kmer_counts.jf > kmer_histo.txt
```
**Step 2 — Analyze with GenomeScope2**
```bash
# Install: pip install genomescope2
genomescope2
-i kmer_histo.txt
-o genomescope_output
-k 21
-p 2 # ploidy: 1=haploid, 2=diploid
```
**What GenomeScope2 reports:**
- **Genome size (haploid)**: total unique sequence
- **Heterozygosity**: proportion of heterozygous sites
- **Repeat content**: estimated repetitive fraction
- **k-mer coverage**: which feeds into assembly parameters
**Interpreting the histogram:**
- First small peak: sequencing errors (k-mers appearing only once)
- Main peak: heterozygous k-mers (at ~half-coverage)
- Second peak: homozygous k-mers (at ~full-coverage)
For polyploids, use `smudgeplots` alongside GenomeScope2 to determine ploidy level.
```python
# Quick Python estimate from histogram
import pandas as pd
histo = pd.read_csv('kmer_histo.txt', sep=' ', header=None, names=['freq','count'])
# Peak coverage × total distinct k-mers ≈ genome size
peak_cov = histo.loc[histo['count'].idxmax(), 'freq']
genome_size_bp = histo['count'].sum() / peak_cov
print(f'Estimated genome size: {genome_size_bp/1e6:.1f} Mb')
```