33
How to parse, filter, and manipulate FASTA files using Biopython
I have a multi-sequence FASTA file with 50,000 protein sequences. I want to: (1) filter sequences by length (keep only 100–500 aa), (2) calculate amino acid composition, (3) write filtered sequences to a new FASTA. What is the most efficient way to do this with Biopython?
4 views
1 Answer
27
✓
✓ Accepted Answer
Here is a complete example using `Bio.SeqIO`:
```python
from Bio import SeqIO
from collections import Counter
import pandas as pd
def filter_and_analyze_fasta(input_fasta, output_fasta, min_len=100, max_len=500):
stats = []
filtered = []
for record in SeqIO.parse(input_fasta, 'fasta'):
seq_len = len(record.seq)
if min_len <= seq_len <= max_len:
filtered.append(record)
# Amino acid composition
aa_count = Counter(str(record.seq).upper())
stats.append({
'id': record.id,
'length': seq_len,
'gc': None, # not applicable for protein
'A_pct': aa_count['A'] / seq_len * 100,
'G_pct': aa_count['G'] / seq_len * 100,
})
# Write filtered sequences
written = SeqIO.write(filtered, output_fasta, 'fasta')
print(f'Wrote {written} sequences')
return pd.DataFrame(stats)
df = filter_and_analyze_fasta(
'all_proteins.fasta',
'filtered_proteins.fasta'
)
print(df.describe())
```
**For very large files (>1M sequences)**, use a generator approach to avoid loading everything into memory:
```python
filtered = (
rec for rec in SeqIO.parse('large.fasta', 'fasta')
if 100 <= len(rec.seq) <= 500
)
SeqIO.write(filtered, 'filtered.fasta', 'fasta')
```
**Useful Biopython tips:**
- `SeqIO.to_dict()` loads all records into a dict by ID (fast lookup, needs RAM)
- `SeqIO.index()` creates a disk-backed index (slow lookup, no RAM needed)
- For nucleotide GC content: `from Bio.SeqUtils import gc_fraction`