How to perform multiple sequence alignment with MUSCLE5 and prepare it for phylogenetic analysis

Question

I have 120 protein sequences from orthologous genes across different fungal species. I need to create a multiple sequence alignment for phylogenetic analysis. Should I use MUSCLE5, MAFFT, or Clustal? And how do I clean up the alignment afterwards to remove poorly aligned regions?

Admin · Accepted Answer

For 120 protein sequences, MUSCLE5 and MAFFT are both excellent choices. MUSCLE5 is often more accurate; MAFFT is faster for very large datasets (>1000 sequences).

**MUSCLE5 (recommended for accuracy):**
```bash
# Basic alignment
muscle -align sequences.fasta -output aligned_muscle.fasta

# High accuracy mode (slower, best for phylogenetics)
muscle -align sequences.fasta -output aligned_muscle.fasta 
       -refinements 4 -stratified
```

**MAFFT (alternative, very fast):**
```bash
# Auto mode (selects algorithm based on dataset size)
mafft --auto sequences.fasta > aligned_mafft.fasta

# L-INS-i mode: highest accuracy for < 200 sequences
mafft --localpair --maxiterate 1000 sequences.fasta > aligned_mafft.fasta
```

**Trim poorly aligned regions with trimAl:**
```bash
# Automated trimming (good for phylogenomics)
trimal -in aligned_muscle.fasta -out trimmed.fasta -automated1

# Manual: remove columns with >50% gaps
trimal -in aligned_muscle.fasta -out trimmed.fasta -gt 0.5
```

**Quality check in Python:**
```python
from Bio import AlignIO

align = AlignIO.read('trimmed.fasta', 'fasta')
print(f'Sequences: {len(align)}')
print(f'Alignment length: {align.get_alignment_length()} columns')

# Check gap content per column
gap_per_col = [sum(1 for r in align if r.seq[i] == '-') / len(align)
               for i in range(align.get_alignment_length())]
print(f'Mean gap fraction: {sum(gap_per_col)/len(gap_per_col):.2%}')
```

For phylogenomics with many loci, consider OrthoFinder + BUSCO + concatenation/coalescence approach instead of single-gene trees.