52

What is the difference between TPM, FPKM, and RPKM in RNA-seq gene expression normalization?

I keep seeing TPM, FPKM, and RPKM used in RNA-seq papers and tools. What is the mathematical difference between them? Which should I use for between-sample comparisons, and which do tools like DESeq2 and edgeR actually want?
2 views asked 3 weeks ago by Admin
1 Answer
46
✓ Accepted Answer
These are all length- and depth-normalized expression units, but they differ critically in normalization order: **RPKM** (Reads Per Kilobase per Million mapped reads) - Normalize for sequencing depth first, then gene length - Formula: `RPKM = (read_count × 10^9) / (total_reads × gene_length_bp)` - Problem: sum of RPKM values differs between samples → NOT comparable across samples **FPKM** = RPKM for paired-end reads. Same formula, same problem. **TPM** (Transcripts Per Million) - Normalize for gene length FIRST, then sequencing depth - Formula: `TPM = (read_count / gene_length_kb) / sum_of_all_RPK × 10^6` - TPM values in each sample always sum to 1 million → comparable across samples ```python import pandas as pd import numpy as np def counts_to_tpm(counts_df, gene_lengths): """counts_df: genes × samples, gene_lengths: Series indexed by gene""" rpk = counts_df.div(gene_lengths / 1000, axis=0) # per kilobase scale = rpk.sum(axis=0) / 1e6 # per million scaling return rpk.div(scale, axis=1) tpm = counts_to_tpm(raw_counts, gene_length_series) ``` **Which to use:** - **DESeq2 / edgeR**: raw integer counts only — they do their own normalization internally - **Between-sample comparison**: TPM (not FPKM/RPKM) - **Publication figures**: TPM - **Cross-study comparison**: none of the above; use ComBat-seq or similar **Rule of thumb**: If you pre-normalize before DESeq2 or edgeR, your DE results will be wrong.
answered 1 week ago by Admin