18

How to perform multiple sequence alignment with MUSCLE5 and prepare it for phylogenetic analysis

I have 120 protein sequences from orthologous genes across different fungal species. I need to create a multiple sequence alignment for phylogenetic analysis. Should I use MUSCLE5, MAFFT, or Clustal? And how do I clean up the alignment afterwards to remove poorly aligned regions?
1 views asked 1 month ago by Admin
1 Answer
15
✓ Accepted Answer
For 120 protein sequences, MUSCLE5 and MAFFT are both excellent choices. MUSCLE5 is often more accurate; MAFFT is faster for very large datasets (>1000 sequences). **MUSCLE5 (recommended for accuracy):** ```bash # Basic alignment muscle -align sequences.fasta -output aligned_muscle.fasta # High accuracy mode (slower, best for phylogenetics) muscle -align sequences.fasta -output aligned_muscle.fasta -refinements 4 -stratified ``` **MAFFT (alternative, very fast):** ```bash # Auto mode (selects algorithm based on dataset size) mafft --auto sequences.fasta > aligned_mafft.fasta # L-INS-i mode: highest accuracy for < 200 sequences mafft --localpair --maxiterate 1000 sequences.fasta > aligned_mafft.fasta ``` **Trim poorly aligned regions with trimAl:** ```bash # Automated trimming (good for phylogenomics) trimal -in aligned_muscle.fasta -out trimmed.fasta -automated1 # Manual: remove columns with >50% gaps trimal -in aligned_muscle.fasta -out trimmed.fasta -gt 0.5 ``` **Quality check in Python:** ```python from Bio import AlignIO align = AlignIO.read('trimmed.fasta', 'fasta') print(f'Sequences: {len(align)}') print(f'Alignment length: {align.get_alignment_length()} columns') # Check gap content per column gap_per_col = [sum(1 for r in align if r.seq[i] == '-') / len(align) for i in range(align.get_alignment_length())] print(f'Mean gap fraction: {sum(gap_per_col)/len(gap_per_col):.2%}') ``` For phylogenomics with many loci, consider OrthoFinder + BUSCO + concatenation/coalescence approach instead of single-gene trees.
answered 1 day ago by Admin