41
What is the best way to normalize RNA-seq count data before differential expression analysis?
I'm doing differential expression analysis with DESeq2 in R. I have raw count data from featureCounts. Should I normalize the counts before passing them to DESeq2, or does DESeq2 handle this internally?
Also, what's the difference between TMM, TPM, RPKM, and DESeq2's own normalization?
3 views
1 Answer
35
✓
✓ Accepted Answer
**Do NOT pre-normalize your counts before DESeq2.** DESeq2 expects raw integer counts and does its own normalization internally using the median-of-ratios method.
```r
library(DESeq2)
# Load raw counts (NOT TPM or RPKM)
counts <- read.csv('raw_counts.csv', row.names=1)
coldata <- data.frame(condition=c('ctrl','ctrl','treat','treat'), row.names=colnames(counts))
dds <- DESeqDataSetFromMatrix(
countData = round(counts), # must be integers
colData = coldata,
design = ~ condition
)
dds <- DESeq(dds) # normalization happens here
results <- results(dds)
```
**Key differences:**
- **RPKM/FPKM**: Normalize for sequencing depth AND gene length. Not suitable for cross-sample comparison.
- **TPM**: Better than RPKM for cross-sample comparison but still not ideal for DE.
- **TMM** (edgeR): Accounts for compositional bias. Good for edgeR.
- **DESeq2 median-of-ratios**: Best for DESeq2 — robust to outliers and compositional effects.