How to annotate protein domains using HMMER hmmscan against the Pfam database

Question

I have 500 novel protein sequences predicted from a de novo genome assembly and I want to annotate them with known functional domains. How do I use HMMER hmmscan to search against the Pfam database, and how do I interpret the output to get reliable domain annotations?

Admin · Accepted Answer

**Step 1 — Download and press the Pfam database**
```bash
# Download Pfam-A.hmm from InterPro
wget https://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/Pfam-A.hmm.gz
gunzip Pfam-A.hmm.gz

# Press the database (creates binary index files)
hmmpress Pfam-A.hmm
```

**Step 2 — Run hmmscan**
```bash
hmmscan 
  --cpu 8 
  --domtblout domains.txt 
  --tblout proteins.txt 
  -E 1e-5 
  --domE 1e-5 
  Pfam-A.hmm 
  proteins.fasta
```

**Step 3 — Parse the domain table output**
```python
import pandas as pd

def parse_hmmscan_domtbl(filepath):
    rows = []
    with open(filepath) as f:
        for line in f:
            if line.startswith('#'): continue
            cols = line.split()
            if len(cols) < 22: continue
            rows.append({
                'target_name':   cols[0],   # Pfam domain name
                'query_name':    cols[3],   # Your protein
                'evalue':        float(cols[6]),
                'score':         float(cols[7]),
                'domain_start':  int(cols[17]),
                'domain_end':    int(cols[18]),
                'description':   ' '.join(cols[22:])
            })
    df = pd.DataFrame(rows)
    # Filter to significant hits only
    return df[df['evalue'] < 1e-5]

domains = parse_hmmscan_domtbl('domains.txt')
print(domains.groupby('target_name').size().sort_values(ascending=False).head(20))
```

**Significance thresholds:**
- Use Pfam's own GA (gathering threshold) with `--cut_ga` flag for the most reliable annotations
- E-value < 1e-5 is a reasonable independent threshold
- Always check `i-Evalue` (independent E-value per domain) in column 12

For faster annotation of large proteomes, InterProScan wraps HMMER + many other databases in one command.