15
How do I calculate pairwise sequence identity from a multiple sequence alignment in BioPython?
I have a multiple sequence alignment (MSA) in FASTA format and I want to calculate pairwise percent identity for all pairs of sequences. I'm using BioPython but can't find a built-in function for this.
My alignment file has 50 sequences and I need an identity matrix.
5 views
1 Answer
12
✓
✓ Accepted Answer
BioPython doesn't have a built-in pairwise identity function for MSAs, but it's easy to write one:
```python
from Bio import AlignIO
import numpy as np
def pairwise_identity(align):
n = len(align)
ids = np.zeros((n, n))
for i in range(n):
for j in range(i, n):
seq1 = str(align[i].seq)
seq2 = str(align[j].seq)
# Count positions where both are not gaps
pairs = [(a, b) for a, b in zip(seq1, seq2)
if a != '-' and b != '-']
if not pairs:
continue
matches = sum(a == b for a, b in pairs)
identity = matches / len(pairs) * 100
ids[i][j] = ids[j][i] = identity
return ids
align = AlignIO.read('alignment.fasta', 'fasta')
identity_matrix = pairwise_identity(align)
# Print as table
names = [rec.id for rec in align]
for i, name in enumerate(names):
row = 't'.join(f'{identity_matrix[i][j]:.1f}' for j in range(len(names)))
print(f'{name}t{row}')
```
For large alignments (>200 sequences) use the `pyMSAviz` library or convert to a distance matrix with `Bio.Phylo.TreeConstruction.DistanceCalculator`.