44

How to build a reproducible bioinformatics pipeline with Snakemake

I want to build a RNA-seq analysis pipeline that I can share with collaborators and rerun on different datasets without manually changing paths. I've heard Snakemake is good for this. What is the basic structure of a Snakemake workflow and how do I parameterize it with a config file?
4 views asked 1 week ago by Admin
1 Answer
37
✓ Accepted Answer
Snakemake is excellent for this. Here's a minimal but complete RNA-seq Snakemake setup: **config.yaml** ```yaml samples: - sample1 - sample2 - sample3 ref: genome: /data/hg38/genome.fa gtf: /data/hg38/genes.gtf params: trimmomatic: adapters: TruSeq3-PE.fa leading: 3 trailing: 3 minlen: 36 ``` **Snakefile** ```python configfile: "config.yaml" SAMPLES = config["samples"] rule all: input: expand("results/counts/{sample}.txt", sample=SAMPLES), "results/multiqc_report.html" rule trim: input: r1 = "data/{sample}_R1.fastq.gz", r2 = "data/{sample}_R2.fastq.gz" output: r1 = "trimmed/{sample}_R1.fastq.gz", r2 = "trimmed/{sample}_R2.fastq.gz" shell: """ trimmomatic PE {input.r1} {input.r2} {output.r1} /dev/null {output.r2} /dev/null ILLUMINACLIP:{config[params][trimmomatic][adapters]}:2:30:10 LEADING:{config[params][trimmomatic][leading]} TRAILING:{config[params][trimmomatic][trailing]} MINLEN:{config[params][trimmomatic][minlen]} """ rule star_align: input: r1 = "trimmed/{sample}_R1.fastq.gz", r2 = "trimmed/{sample}_R2.fastq.gz", genome = config["ref"]["genome"] output: "aligned/{sample}/Aligned.sortedByCoord.out.bam" threads: 8 shell: """ STAR --runThreadN {threads} --genomeDir $(dirname {input.genome}) --readFilesIn {input.r1} {input.r2} --readFilesCommand zcat --outSAMtype BAM SortedByCoordinate --outFileNamePrefix aligned/{wildcards.sample}/ """ rule featurecounts: input: bam = "aligned/{sample}/Aligned.sortedByCoord.out.bam", gtf = config["ref"]["gtf"] output: "results/counts/{sample}.txt" shell: "featureCounts -T 4 -a {input.gtf} -o {output} {input.bam}" ``` **Run it:** ```bash # Dry run (preview what will execute) snakemake -n # Run with 8 cores snakemake --cores 8 # Run on SLURM cluster snakemake --cores 8 --cluster "sbatch -c {threads} --mem 16G" ``` Key Snakemake concepts: `wildcards` match sample names automatically, `expand()` generates all target file paths, `rule all` declares the final outputs.
answered 1 week ago by Admin