How to build a reproducible bioinformatics pipeline with Snakemake

Question

I want to build a RNA-seq analysis pipeline that I can share with collaborators and rerun on different datasets without manually changing paths. I've heard Snakemake is good for this. What is the basic structure of a Snakemake workflow and how do I parameterize it with a config file?

Admin · Accepted Answer

Snakemake is excellent for this. Here's a minimal but complete RNA-seq Snakemake setup:

**config.yaml**
```yaml
samples:
  - sample1
  - sample2
  - sample3
ref:
  genome: /data/hg38/genome.fa
  gtf:    /data/hg38/genes.gtf
params:
  trimmomatic:
    adapters: TruSeq3-PE.fa
    leading:  3
    trailing: 3
    minlen:   36
```

**Snakefile**
```python
configfile: "config.yaml"

SAMPLES = config["samples"]

rule all:
    input:
        expand("results/counts/{sample}.txt", sample=SAMPLES),
        "results/multiqc_report.html"

rule trim:
    input:
        r1 = "data/{sample}_R1.fastq.gz",
        r2 = "data/{sample}_R2.fastq.gz"
    output:
        r1 = "trimmed/{sample}_R1.fastq.gz",
        r2 = "trimmed/{sample}_R2.fastq.gz"
    shell:
        """
        trimmomatic PE {input.r1} {input.r2} 
          {output.r1} /dev/null {output.r2} /dev/null 
          ILLUMINACLIP:{config[params][trimmomatic][adapters]}:2:30:10 
          LEADING:{config[params][trimmomatic][leading]} 
          TRAILING:{config[params][trimmomatic][trailing]} 
          MINLEN:{config[params][trimmomatic][minlen]}
        """

rule star_align:
    input:
        r1     = "trimmed/{sample}_R1.fastq.gz",
        r2     = "trimmed/{sample}_R2.fastq.gz",
        genome = config["ref"]["genome"]
    output: "aligned/{sample}/Aligned.sortedByCoord.out.bam"
    threads: 8
    shell:
        """
        STAR --runThreadN {threads} 
             --genomeDir $(dirname {input.genome}) 
             --readFilesIn {input.r1} {input.r2} 
             --readFilesCommand zcat 
             --outSAMtype BAM SortedByCoordinate 
             --outFileNamePrefix aligned/{wildcards.sample}/
        """

rule featurecounts:
    input:
        bam = "aligned/{sample}/Aligned.sortedByCoord.out.bam",
        gtf = config["ref"]["gtf"]
    output: "results/counts/{sample}.txt"
    shell:
        "featureCounts -T 4 -a {input.gtf} -o {output} {input.bam}"
```

**Run it:**
```bash
# Dry run (preview what will execute)
snakemake -n

# Run with 8 cores
snakemake --cores 8

# Run on SLURM cluster
snakemake --cores 8 --cluster "sbatch -c {threads} --mem 16G"
```

Key Snakemake concepts: `wildcards` match sample names automatically, `expand()` generates all target file paths, `rule all` declares the final outputs.