33

How to parse, filter, and manipulate FASTA files using Biopython

I have a multi-sequence FASTA file with 50,000 protein sequences. I want to: (1) filter sequences by length (keep only 100–500 aa), (2) calculate amino acid composition, (3) write filtered sequences to a new FASTA. What is the most efficient way to do this with Biopython?
3 views asked 1 week ago by Admin
1 Answer
27
✓ Accepted Answer
Here is a complete example using `Bio.SeqIO`: ```python from Bio import SeqIO from collections import Counter import pandas as pd def filter_and_analyze_fasta(input_fasta, output_fasta, min_len=100, max_len=500): stats = [] filtered = [] for record in SeqIO.parse(input_fasta, 'fasta'): seq_len = len(record.seq) if min_len <= seq_len <= max_len: filtered.append(record) # Amino acid composition aa_count = Counter(str(record.seq).upper()) stats.append({ 'id': record.id, 'length': seq_len, 'gc': None, # not applicable for protein 'A_pct': aa_count['A'] / seq_len * 100, 'G_pct': aa_count['G'] / seq_len * 100, }) # Write filtered sequences written = SeqIO.write(filtered, output_fasta, 'fasta') print(f'Wrote {written} sequences') return pd.DataFrame(stats) df = filter_and_analyze_fasta( 'all_proteins.fasta', 'filtered_proteins.fasta' ) print(df.describe()) ``` **For very large files (>1M sequences)**, use a generator approach to avoid loading everything into memory: ```python filtered = ( rec for rec in SeqIO.parse('large.fasta', 'fasta') if 100 <= len(rec.seq) <= 500 ) SeqIO.write(filtered, 'filtered.fasta', 'fasta') ``` **Useful Biopython tips:** - `SeqIO.to_dict()` loads all records into a dict by ID (fast lookup, needs RAM) - `SeqIO.index()` creates a disk-backed index (slow lookup, no RAM needed) - For nucleotide GC content: `from Bio.SeqUtils import gc_fraction`
answered 4 weeks ago by Admin