52
What is the difference between TPM, FPKM, and RPKM in RNA-seq gene expression normalization?
I keep seeing TPM, FPKM, and RPKM used in RNA-seq papers and tools. What is the mathematical difference between them? Which should I use for between-sample comparisons, and which do tools like DESeq2 and edgeR actually want?
4 views
1 Answer
46
✓
✓ Accepted Answer
These are all length- and depth-normalized expression units, but they differ critically in normalization order:
**RPKM** (Reads Per Kilobase per Million mapped reads)
- Normalize for sequencing depth first, then gene length
- Formula: `RPKM = (read_count × 10^9) / (total_reads × gene_length_bp)`
- Problem: sum of RPKM values differs between samples → NOT comparable across samples
**FPKM** = RPKM for paired-end reads. Same formula, same problem.
**TPM** (Transcripts Per Million)
- Normalize for gene length FIRST, then sequencing depth
- Formula: `TPM = (read_count / gene_length_kb) / sum_of_all_RPK × 10^6`
- TPM values in each sample always sum to 1 million → comparable across samples
```python
import pandas as pd
import numpy as np
def counts_to_tpm(counts_df, gene_lengths):
"""counts_df: genes × samples, gene_lengths: Series indexed by gene"""
rpk = counts_df.div(gene_lengths / 1000, axis=0) # per kilobase
scale = rpk.sum(axis=0) / 1e6 # per million scaling
return rpk.div(scale, axis=1)
tpm = counts_to_tpm(raw_counts, gene_length_series)
```
**Which to use:**
- **DESeq2 / edgeR**: raw integer counts only — they do their own normalization internally
- **Between-sample comparison**: TPM (not FPKM/RPKM)
- **Publication figures**: TPM
- **Cross-study comparison**: none of the above; use ComBat-seq or similar
**Rule of thumb**: If you pre-normalize before DESeq2 or edgeR, your DE results will be wrong.