31
What is the standard bioinformatics pipeline for metagenome-assembled genome (MAG) analysis?
I have shotgun metagenomics data from soil samples (Illumina PE 150 bp, ~20 Gb per sample). I want to assemble metagenomes, bin them into MAGs, assess bin quality, and do taxonomic classification. What tools and what order should I use for the complete MAG analysis workflow?
4 views
1 Answer
26
✓
✓ Accepted Answer
Here is the current gold-standard MAG analysis pipeline:
**1. Quality control**
```bash
# Trim adapters and low-quality reads
fastp -i R1.fastq.gz -I R2.fastq.gz
-o R1_clean.fastq.gz -O R2_clean.fastq.gz
-j fastp_report.json -h fastp_report.html
-w 8 --detect_adapter_for_pe
```
**2. Co-assembly with MEGAHIT** (best for metagenomes)
```bash
megahit -1 R1_clean.fastq.gz -2 R2_clean.fastq.gz
-o megahit_assembly
--min-contig-len 1000
-t 16 --k-list 21,41,61,81,99
```
**3. Map reads back to assembly (for coverage)**
```bash
bwa-mem2 index megahit_assembly/final.contigs.fa
bwa-mem2 mem -t 16 megahit_assembly/final.contigs.fa
R1_clean.fastq.gz R2_clean.fastq.gz |
samtools sort -@ 8 -o mapped.bam
samtools index mapped.bam
```
**4. Bin contigs with MetaBAT2**
```bash
jgi_summarize_bam_contig_depths --outputDepth depth.txt mapped.bam
metabat2 -i megahit_assembly/final.contigs.fa
-a depth.txt
-o bins/bin
--minContig 2000 -t 16
```
**5. Assess bin quality with CheckM2**
```bash
checkm2 predict --threads 16
--input bins/ --extension .fa
--output-directory checkm2_output
```
**Quality thresholds** (MIMAG standards):
- High quality MAG: completeness >90%, contamination <5%
- Medium quality MAG: completeness >50%, contamination <10%
**6. Classify MAGs with GTDB-Tk**
```bash
gtdbtk classify_wf
--genome_dir bins/ --extension .fa
--out_dir gtdbtk_output
--cpus 16 --skip_ani_screen
```
For multi-sample co-assemblies, use DASTool for bin refinement after MetaBAT2.