29 Questions Asked
How to handle batch effects in scRNA-seq data using Seurat?
I’m integrating scRNA-seq datasets from 3 different batches (different labs, same tissue type). After merging in Seurat, the UMAP clusters by batch rather than by…
Fastest way to parse a large VCF file in Python for GWAS analysis?
I have a VCF file with ~15 million SNPs and 5000 samples (~40 GB). I need to extract allele frequencies and filter by MAF >…
How do I calculate pairwise sequence identity from a multiple sequence alignment in BioPython?
I have a multiple sequence alignment (MSA) in FASTA format and I want to calculate pairwise percent identity for all pairs of sequences. I’m using…
How do I perform a local BLAST search against a custom protein database in Python?
I have a set of protein sequences in a FASTA file and I want to run a local BLAST search against a custom database I…
How to analyze 16S rRNA amplicon sequencing data with QIIME2 from raw reads to diversity metrics
I have paired-end 16S V4 amplicon sequencing data (Illumina MiSeq, 250 bp PE reads) from 20 gut microbiome samples. I want to identify taxa, calculate…
How to use Docker and Singularity to containerize bioinformatics tools for reproducibility
I want to make my bioinformatics analysis fully reproducible using containers. My HPC cluster doesn’t allow Docker (requires root), but Singularity is available. How do…
What is the best way to normalize RNA-seq count data before differential expression analysis?
I’m doing differential expression analysis with DESeq2 in R. I have raw count data from featureCounts. Should I normalize the counts before passing them to…
Genome assembly with Flye for long reads: what coverage depth is needed for a good assembly?
I’m assembling a bacterial genome (~4.5 Mb) using Oxford Nanopore reads with Flye. I have about 15x coverage right now. The assembly is fragmented (150+…
How to annotate protein domains using HMMER hmmscan against the Pfam database
I have 500 novel protein sequences predicted from a de novo genome assembly and I want to annotate them with known functional domains. How do…
31 Answers Given
Bioinformatics is an interdisciplinary field that develops and uses computational methods, software tools, and statistics to store, analyze, and interpret large, complex biological datasets, particularly…
If you're stuck with 15x coverage, you can try Raven or Miniasm as alternatives — they sometimes perform better at low coverage: ```bash raven --threads…
For 120 protein sequences, MUSCLE5 and MAFFT are both excellent choices. MUSCLE5 is often more accurate; MAFFT is faster for very large datasets (>1000 sequences).…
Here is the complete QIIME2 workflow for paired-end 16S data: **1. Import reads** ```bash qiime tools import --type 'SampleData[PairedEndSequencesWithQuality]' --input-path manifest.csv --input-format PairedEndFastqManifestPhred33V2 --output-path demux.qza…
You need to first create the BLAST database using `makeblastdb` before you can query it. Here's the full workflow: ```python from Bio.Blast.Applications import NcbimakeblastdbCommandline, NcbiblastpCommandline…
Use **cyvcf2** — it's ~20x faster than PyVCF because it wraps htslib in C: ```python from cyvcf2 import VCF import numpy as np vcf =…
BioPython doesn't have a built-in pairwise identity function for MSAs, but it's easy to write one: ```python from Bio import AlignIO import numpy as np…
For bacterial genomes with Flye and Nanopore reads, you generally want **30–60x coverage** for a good assembly. 15x is too low and explains the fragmentation.…
**Do NOT pre-normalize your counts before DESeq2.** DESeq2 expects raw integer counts and does its own normalization internally using the median-of-ratios method. ```r library(DESeq2) #…
Harmony is a good choice. Here's the correct workflow: ```r library(Seurat) library(harmony) # Merge your objects combined