15

How do I calculate pairwise sequence identity from a multiple sequence alignment in BioPython?

I have a multiple sequence alignment (MSA) in FASTA format and I want to calculate pairwise percent identity for all pairs of sequences. I'm using BioPython but can't find a built-in function for this. My alignment file has 50 sequences and I need an identity matrix.
2 views asked 2 days ago by Admin
1 Answer
12
✓ Accepted Answer
BioPython doesn't have a built-in pairwise identity function for MSAs, but it's easy to write one: ```python from Bio import AlignIO import numpy as np def pairwise_identity(align): n = len(align) ids = np.zeros((n, n)) for i in range(n): for j in range(i, n): seq1 = str(align[i].seq) seq2 = str(align[j].seq) # Count positions where both are not gaps pairs = [(a, b) for a, b in zip(seq1, seq2) if a != '-' and b != '-'] if not pairs: continue matches = sum(a == b for a, b in pairs) identity = matches / len(pairs) * 100 ids[i][j] = ids[j][i] = identity return ids align = AlignIO.read('alignment.fasta', 'fasta') identity_matrix = pairwise_identity(align) # Print as table names = [rec.id for rec in align] for i, name in enumerate(names): row = 't'.join(f'{identity_matrix[i][j]:.1f}' for j in range(len(names))) print(f'{name}t{row}') ``` For large alignments (>200 sequences) use the `pyMSAviz` library or convert to a distance matrix with `Bio.Phylo.TreeConstruction.DistanceCalculator`.
answered 3 days ago by Admin