25

How to estimate genome size from sequencing reads using k-mer analysis with Jellyfish and GenomeScope

I have Illumina WGS reads for a non-model organism and want to estimate the genome size before assembly. I've heard about k-mer frequency analysis. How do I use Jellyfish and GenomeScope2 to estimate genome size, heterozygosity, and repeat content?
2 views asked 3 weeks ago by Admin
1 Answer
21
✓ Accepted Answer
K-mer frequency analysis is the standard pre-assembly genome profiling approach. Here is the full workflow: **Step 1 — Count k-mers with Jellyfish** ```bash # Count canonical k-mers (k=21 is standard) jellyfish count -C -m 21 -s 5G -t 16 -o kmer_counts.jf <(zcat reads_R1.fastq.gz reads_R2.fastq.gz) # Export histogram jellyfish histo -t 16 kmer_counts.jf > kmer_histo.txt ``` **Step 2 — Analyze with GenomeScope2** ```bash # Install: pip install genomescope2 genomescope2 -i kmer_histo.txt -o genomescope_output -k 21 -p 2 # ploidy: 1=haploid, 2=diploid ``` **What GenomeScope2 reports:** - **Genome size (haploid)**: total unique sequence - **Heterozygosity**: proportion of heterozygous sites - **Repeat content**: estimated repetitive fraction - **k-mer coverage**: which feeds into assembly parameters **Interpreting the histogram:** - First small peak: sequencing errors (k-mers appearing only once) - Main peak: heterozygous k-mers (at ~half-coverage) - Second peak: homozygous k-mers (at ~full-coverage) For polyploids, use `smudgeplots` alongside GenomeScope2 to determine ploidy level. ```python # Quick Python estimate from histogram import pandas as pd histo = pd.read_csv('kmer_histo.txt', sep=' ', header=None, names=['freq','count']) # Peak coverage × total distinct k-mers ≈ genome size peak_cov = histo.loc[histo['count'].idxmax(), 'freq'] genome_size_bp = histo['count'].sum() / peak_cov print(f'Estimated genome size: {genome_size_bp/1e6:.1f} Mb') ```
answered 4 weeks ago by Admin