22
How to annotate protein domains using HMMER hmmscan against the Pfam database
I have 500 novel protein sequences predicted from a de novo genome assembly and I want to annotate them with known functional domains. How do I use HMMER hmmscan to search against the Pfam database, and how do I interpret the output to get reliable domain annotations?
10 views
1 Answer
19
✓
✓ Accepted Answer
**Step 1 — Download and press the Pfam database**
```bash
# Download Pfam-A.hmm from InterPro
wget https://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/Pfam-A.hmm.gz
gunzip Pfam-A.hmm.gz
# Press the database (creates binary index files)
hmmpress Pfam-A.hmm
```
**Step 2 — Run hmmscan**
```bash
hmmscan
--cpu 8
--domtblout domains.txt
--tblout proteins.txt
-E 1e-5
--domE 1e-5
Pfam-A.hmm
proteins.fasta
```
**Step 3 — Parse the domain table output**
```python
import pandas as pd
def parse_hmmscan_domtbl(filepath):
rows = []
with open(filepath) as f:
for line in f:
if line.startswith('#'): continue
cols = line.split()
if len(cols) < 22: continue
rows.append({
'target_name': cols[0], # Pfam domain name
'query_name': cols[3], # Your protein
'evalue': float(cols[6]),
'score': float(cols[7]),
'domain_start': int(cols[17]),
'domain_end': int(cols[18]),
'description': ' '.join(cols[22:])
})
df = pd.DataFrame(rows)
# Filter to significant hits only
return df[df['evalue'] < 1e-5]
domains = parse_hmmscan_domtbl('domains.txt')
print(domains.groupby('target_name').size().sort_values(ascending=False).head(20))
```
**Significance thresholds:**
- Use Pfam's own GA (gathering threshold) with `--cut_ga` flag for the most reliable annotations
- E-value < 1e-5 is a reasonable independent threshold
- Always check `i-Evalue` (independent E-value per domain) in column 12
For faster annotation of large proteomes, InterProScan wraps HMMER + many other databases in one command.