22

How to annotate protein domains using HMMER hmmscan against the Pfam database

I have 500 novel protein sequences predicted from a de novo genome assembly and I want to annotate them with known functional domains. How do I use HMMER hmmscan to search against the Pfam database, and how do I interpret the output to get reliable domain annotations?
9 views asked 1 week ago by Admin
1 Answer
19
✓ Accepted Answer
**Step 1 — Download and press the Pfam database** ```bash # Download Pfam-A.hmm from InterPro wget https://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/Pfam-A.hmm.gz gunzip Pfam-A.hmm.gz # Press the database (creates binary index files) hmmpress Pfam-A.hmm ``` **Step 2 — Run hmmscan** ```bash hmmscan --cpu 8 --domtblout domains.txt --tblout proteins.txt -E 1e-5 --domE 1e-5 Pfam-A.hmm proteins.fasta ``` **Step 3 — Parse the domain table output** ```python import pandas as pd def parse_hmmscan_domtbl(filepath): rows = [] with open(filepath) as f: for line in f: if line.startswith('#'): continue cols = line.split() if len(cols) < 22: continue rows.append({ 'target_name': cols[0], # Pfam domain name 'query_name': cols[3], # Your protein 'evalue': float(cols[6]), 'score': float(cols[7]), 'domain_start': int(cols[17]), 'domain_end': int(cols[18]), 'description': ' '.join(cols[22:]) }) df = pd.DataFrame(rows) # Filter to significant hits only return df[df['evalue'] < 1e-5] domains = parse_hmmscan_domtbl('domains.txt') print(domains.groupby('target_name').size().sort_values(ascending=False).head(20)) ``` **Significance thresholds:** - Use Pfam's own GA (gathering threshold) with `--cut_ga` flag for the most reliable annotations - E-value < 1e-5 is a reasonable independent threshold - Always check `i-Evalue` (independent E-value per domain) in column 12 For faster annotation of large proteomes, InterProScan wraps HMMER + many other databases in one command.
answered 3 weeks ago by Admin