How to align RNA-seq reads using STAR aligner — genome indexing to sorted BAM

Question

I need to align paired-end RNA-seq reads to the human GRCh38 reference genome using STAR. What are the steps to: (1) generate the genome index, (2) run the alignment with splice-aware settings, and (3) get a sorted, indexed BAM for downstream quantification with featureCounts?

Admin · Accepted Answer

**Step 1 — Generate STAR genome index** (run once per reference)
```bash
STAR 
  --runMode genomeGenerate 
  --genomeDir /ref/hg38_star_index 
  --genomeFastaFiles /ref/hg38.fa 
  --sjdbGTFfile /ref/gencode.v44.annotation.gtf 
  --sjdbOverhang 149 
  --runThreadN 16
# sjdbOverhang = read_length - 1 (for 150 bp reads use 149)
```

**Step 2 — Align reads**
```bash
STAR 
  --runThreadN 8 
  --genomeDir /ref/hg38_star_index 
  --readFilesIn sample_R1.fastq.gz sample_R2.fastq.gz 
  --readFilesCommand zcat 
  --outSAMtype BAM SortedByCoordinate 
  --outSAMattributes NH HI AS NM MD 
  --outFilterMultimapNmax 20 
  --alignSJoverhangMin 8 
  --alignSJDBoverhangMin 1 
  --alignIntronMin 20 
  --alignIntronMax 1000000 
  --outFileNamePrefix results/sample/ 
  --quantMode GeneCounts
```

**Step 3 — Index the BAM**
```bash
samtools index results/sample/Aligned.sortedByCoord.out.bam
```

**Step 4 — Count genes with featureCounts**
```bash
featureCounts 
  -T 8 -p -s 2 
  -a /ref/gencode.v44.annotation.gtf 
  -o results/sample_counts.txt 
  results/sample/Aligned.sortedByCoord.out.bam
```

**Key parameters explained:**
- `--outSAMtype BAM SortedByCoordinate`: sorted output, no need for `samtools sort`
- `--quantMode GeneCounts`: STAR also outputs raw counts per gene directly
- `-s 2` in featureCounts: reverse-stranded library (most Illumina TruSeq kits)
- Check your strandedness with `infer_experiment.py` from RSeQC