41
How to align RNA-seq reads using STAR aligner — genome indexing to sorted BAM
I need to align paired-end RNA-seq reads to the human GRCh38 reference genome using STAR. What are the steps to: (1) generate the genome index, (2) run the alignment with splice-aware settings, and (3) get a sorted, indexed BAM for downstream quantification with featureCounts?
3 views
1 Answer
35
✓
✓ Accepted Answer
**Step 1 — Generate STAR genome index** (run once per reference)
```bash
STAR
--runMode genomeGenerate
--genomeDir /ref/hg38_star_index
--genomeFastaFiles /ref/hg38.fa
--sjdbGTFfile /ref/gencode.v44.annotation.gtf
--sjdbOverhang 149
--runThreadN 16
# sjdbOverhang = read_length - 1 (for 150 bp reads use 149)
```
**Step 2 — Align reads**
```bash
STAR
--runThreadN 8
--genomeDir /ref/hg38_star_index
--readFilesIn sample_R1.fastq.gz sample_R2.fastq.gz
--readFilesCommand zcat
--outSAMtype BAM SortedByCoordinate
--outSAMattributes NH HI AS NM MD
--outFilterMultimapNmax 20
--alignSJoverhangMin 8
--alignSJDBoverhangMin 1
--alignIntronMin 20
--alignIntronMax 1000000
--outFileNamePrefix results/sample/
--quantMode GeneCounts
```
**Step 3 — Index the BAM**
```bash
samtools index results/sample/Aligned.sortedByCoord.out.bam
```
**Step 4 — Count genes with featureCounts**
```bash
featureCounts
-T 8 -p -s 2
-a /ref/gencode.v44.annotation.gtf
-o results/sample_counts.txt
results/sample/Aligned.sortedByCoord.out.bam
```
**Key parameters explained:**
- `--outSAMtype BAM SortedByCoordinate`: sorted output, no need for `samtools sort`
- `--quantMode GeneCounts`: STAR also outputs raw counts per gene directly
- `-s 2` in featureCounts: reverse-stranded library (most Illumina TruSeq kits)
- Check your strandedness with `infer_experiment.py` from RSeQC