44
How to build a reproducible bioinformatics pipeline with Snakemake
I want to build a RNA-seq analysis pipeline that I can share with collaborators and rerun on different datasets without manually changing paths. I've heard Snakemake is good for this. What is the basic structure of a Snakemake workflow and how do I parameterize it with a config file?
3 views
1 Answer
37
✓
✓ Accepted Answer
Snakemake is excellent for this. Here's a minimal but complete RNA-seq Snakemake setup:
**config.yaml**
```yaml
samples:
- sample1
- sample2
- sample3
ref:
genome: /data/hg38/genome.fa
gtf: /data/hg38/genes.gtf
params:
trimmomatic:
adapters: TruSeq3-PE.fa
leading: 3
trailing: 3
minlen: 36
```
**Snakefile**
```python
configfile: "config.yaml"
SAMPLES = config["samples"]
rule all:
input:
expand("results/counts/{sample}.txt", sample=SAMPLES),
"results/multiqc_report.html"
rule trim:
input:
r1 = "data/{sample}_R1.fastq.gz",
r2 = "data/{sample}_R2.fastq.gz"
output:
r1 = "trimmed/{sample}_R1.fastq.gz",
r2 = "trimmed/{sample}_R2.fastq.gz"
shell:
"""
trimmomatic PE {input.r1} {input.r2}
{output.r1} /dev/null {output.r2} /dev/null
ILLUMINACLIP:{config[params][trimmomatic][adapters]}:2:30:10
LEADING:{config[params][trimmomatic][leading]}
TRAILING:{config[params][trimmomatic][trailing]}
MINLEN:{config[params][trimmomatic][minlen]}
"""
rule star_align:
input:
r1 = "trimmed/{sample}_R1.fastq.gz",
r2 = "trimmed/{sample}_R2.fastq.gz",
genome = config["ref"]["genome"]
output: "aligned/{sample}/Aligned.sortedByCoord.out.bam"
threads: 8
shell:
"""
STAR --runThreadN {threads}
--genomeDir $(dirname {input.genome})
--readFilesIn {input.r1} {input.r2}
--readFilesCommand zcat
--outSAMtype BAM SortedByCoordinate
--outFileNamePrefix aligned/{wildcards.sample}/
"""
rule featurecounts:
input:
bam = "aligned/{sample}/Aligned.sortedByCoord.out.bam",
gtf = config["ref"]["gtf"]
output: "results/counts/{sample}.txt"
shell:
"featureCounts -T 4 -a {input.gtf} -o {output} {input.bam}"
```
**Run it:**
```bash
# Dry run (preview what will execute)
snakemake -n
# Run with 8 cores
snakemake --cores 8
# Run on SLURM cluster
snakemake --cores 8 --cluster "sbatch -c {threads} --mem 16G"
```
Key Snakemake concepts: `wildcards` match sample names automatically, `expand()` generates all target file paths, `rule all` declares the final outputs.