18
How to perform multiple sequence alignment with MUSCLE5 and prepare it for phylogenetic analysis
I have 120 protein sequences from orthologous genes across different fungal species. I need to create a multiple sequence alignment for phylogenetic analysis. Should I use MUSCLE5, MAFFT, or Clustal? And how do I clean up the alignment afterwards to remove poorly aligned regions?
1 views
1 Answer
15
✓
✓ Accepted Answer
For 120 protein sequences, MUSCLE5 and MAFFT are both excellent choices. MUSCLE5 is often more accurate; MAFFT is faster for very large datasets (>1000 sequences).
**MUSCLE5 (recommended for accuracy):**
```bash
# Basic alignment
muscle -align sequences.fasta -output aligned_muscle.fasta
# High accuracy mode (slower, best for phylogenetics)
muscle -align sequences.fasta -output aligned_muscle.fasta
-refinements 4 -stratified
```
**MAFFT (alternative, very fast):**
```bash
# Auto mode (selects algorithm based on dataset size)
mafft --auto sequences.fasta > aligned_mafft.fasta
# L-INS-i mode: highest accuracy for < 200 sequences
mafft --localpair --maxiterate 1000 sequences.fasta > aligned_mafft.fasta
```
**Trim poorly aligned regions with trimAl:**
```bash
# Automated trimming (good for phylogenomics)
trimal -in aligned_muscle.fasta -out trimmed.fasta -automated1
# Manual: remove columns with >50% gaps
trimal -in aligned_muscle.fasta -out trimmed.fasta -gt 0.5
```
**Quality check in Python:**
```python
from Bio import AlignIO
align = AlignIO.read('trimmed.fasta', 'fasta')
print(f'Sequences: {len(align)}')
print(f'Alignment length: {align.get_alignment_length()} columns')
# Check gap content per column
gap_per_col = [sum(1 for r in align if r.seq[i] == '-') / len(align)
for i in range(align.get_alignment_length())]
print(f'Mean gap fraction: {sum(gap_per_col)/len(gap_per_col):.2%}')
```
For phylogenomics with many loci, consider OrthoFinder + BUSCO + concatenation/coalescence approach instead of single-gene trees.