41

What is the best way to normalize RNA-seq count data before differential expression analysis?

I'm doing differential expression analysis with DESeq2 in R. I have raw count data from featureCounts. Should I normalize the counts before passing them to DESeq2, or does DESeq2 handle this internally? Also, what's the difference between TMM, TPM, RPKM, and DESeq2's own normalization?
3 views asked 5 days ago by Admin
1 Answer
35
✓ Accepted Answer
**Do NOT pre-normalize your counts before DESeq2.** DESeq2 expects raw integer counts and does its own normalization internally using the median-of-ratios method. ```r library(DESeq2) # Load raw counts (NOT TPM or RPKM) counts <- read.csv('raw_counts.csv', row.names=1) coldata <- data.frame(condition=c('ctrl','ctrl','treat','treat'), row.names=colnames(counts)) dds <- DESeqDataSetFromMatrix( countData = round(counts), # must be integers colData = coldata, design = ~ condition ) dds <- DESeq(dds) # normalization happens here results <- results(dds) ``` **Key differences:** - **RPKM/FPKM**: Normalize for sequencing depth AND gene length. Not suitable for cross-sample comparison. - **TPM**: Better than RPKM for cross-sample comparison but still not ideal for DE. - **TMM** (edgeR): Accounts for compositional bias. Good for edgeR. - **DESeq2 median-of-ratios**: Best for DESeq2 — robust to outliers and compositional effects.
answered 4 days ago by Admin