Research

Content

Metagenomics and Metatranscriptomics data analysis

Metagenomics and metatranscriptomics are culture-independent approaches that profile the genomes and transcriptomes of the microbes within a given environment using next-generation sequencing technology. Proper interpretation of the data allows us to understand the dynamics of the microbial community and the symbiosis between itself and the environment. Metagenomics and Metatranscriptomics have important applications in medicine, environmental monitoring, renewable bioenergy etc.

We focus on improving and developing a peptide-centric analysis paradigm, which aims at improving both homolog search and de novo assembly. Homolog search is useful for identifying a set of homolog reads from the sequencing database and estimating relative abundances of the genes or gene families of interest. We have developed software packages GRASP, GRASPx, and HMM-GRASPx to improve homolog search. These software packages implement a newly developed simultaneous alignment and assembly algorithm and achieve ~20%-40% higher sensitivity than BLAST (for single protein sequence search) and HMMER3 (for protein family search). De novo assembly aims at reconstructing complete or near-complete genomic sequences or transcripts of the microbial community. This task remains challenging in terms of both of its performance and computational efficiency. We anticipate to develop novel computational methods to further reduce its computational requirement and make this approach more applicable for larger data sets.

Non-coding RNA structure

Non-coding RNAs are RNAs that do not code for proteins. They play important roles in the biological system such as gene regulation. The structure (both secondary and tertiary) of an RNA usually indicates its biological function, and we attempt to understand non-coding RNA functions through analyzing their structures. We are especially interested in the following computational problems. First, identifying RNA structural motifs from a given 3D structure remains challenging, as the search space is huge and different cutoff has to be applied to account for different resolution. We have previously developed RNAMotifScan and RNAMotifScanX to search for RNA structural motif, and an automatic clustering pipeline RNAMSC for de novo discovery of RNA structural motif families. We plan to develop a probabilistic representation of the RNA 3D structure and extract RNA 3D structures from such a representation. Second, biological functions of the non-coding RNAs remain poorly understood, especially for these encoded in the bacterial genomes. We plan to develop a suite of computational tools to identify, annotate, and archive the non-coding RNA elements in bacterial genomes. Such an analysis pipeline will complete existing metagenomics and metatranscriptomics analyses, which only focus on protein-coding regions at the current stage.

Cancer genomics

Cancer is a disease of the genome. We are interested in analyzing cancer-related next-generation sequencing data, including whole-exon sequencing data, whole-genome sequencing data, RNA-seq data, CHIP-seq data etc. to identify potential genetic or genomic changes that may lead to cancer. We also attempt to understand the mechanism of the disease by connecting the identified variations with their corresponding biological functions and interpreting the effects in a biological-pathway level. We anticipate that the knowledge discovered will improve future cancer diagnosis and treatment.