Academic Activities


Research Projects | Publications | Academic Service |



Research Projects

CAREER: Machine Learning Approaches for Genome-wide Biological Network Interference Project Award Date: 02-22-2007 / Primary Sponsor: NSF

Because of technological limitations, molecular biology research has had to focus on individual genes and gene products. This has led to a wealth of knowledge about individual cellular components and their functions. Isolated cellular components are not sufficient to understand most cellular functions, which are carried out by complex networks. It is therefore imperative to employ network-based approaches to address the complexity of living systems.

Scientists in life-science research must find how to computationally model and elucidate complex networks from high-throughput biological data sets. Thus, this research focuses on developing and applying novel computational methods for reconstructing genome-wide biological networks from high-throughput data.

Researchers will develop machine learning methods for effectively integrating multiple prior knowledge from different data sources, highly heterogeneous data learning, and large-scale network learning. Learning with prior knowledge and highly heterogeneous data sources are fundamental to computational biology, information theory, machine learning, data mining, and other


Protein Function Prediction
http://www.ncbi.nlm.nih.gov/pubmed/20855926, University Of Kansas

While genome sequencing projects have generated tremendous amounts of protein sequence data for a vast number of genomes, substantial portions of most genomes are still unannotated. Despite the success of experimental methods for identifying protein functions, they are often lab intensive and time consuming. Thus, it is only practical to use in silico methods for the genome-wide functional annotations. In this paper, we propose new features extracted from protein sequence only and machine learning-based methods for computational function prediction. These features are derived from a position-specific scoring matrix, which has shown great potential in other bioinformatics problems. We evaluate these features using four different classifiers and yeast protein data. Our experimental results show that features derived from the position-specific scoring matrix are appropriate for automatic function annotation.


Microarray Analysis
http://www.ittc.ku.edu/chenlab/microarray/, University Of Kansas

The microarray analysis system is based on an iterated Hidden Markov Model (iHMM) algorithm. The system is designed for detecting novel genes involved with the same cellular function of query genes which are a group of genes known to be involved in the same function. The input to the algorithm consists of a gene expression microarray data set and a set of seed genes in which their cellular functions are already known. Feature selection reduces the number of experiments before the machine learning algorithm selects a new group of genes from the entire microarray to resemble the functionality of the seed genes. If multiple similar data sets are used a final result set of genes can be selected based on a majority-voting scheme from all individual result sets.


The University of Kansas Proteomics Service (KUPS)
http://www.ittc.ku.edu/chenlab/kups/, University Of Kansas

KUPS (The University of Kansas Proteomics Service) provides high-quality protein-protein interaction (PPI) data for researchers developing and evaluating computational models for predicting PPIs by allowing users to construct ready-to-use data sets of interacting protein pairs (IPPs), non-interacting protein pairs (NIPs) and associated features. Multiple filters and options allow the user to control the make-up of the IPPs and NIPs as well as the quality of the resultant data sets. Each data set is built from the overall database, which includes 185,446 IPPs and approximately 1.5 billion NIPs from five primary databases: IntAct, HPRD, MINT, UniProt and the Gene Ontology. The IPP set can be set to specific model organisms, interaction types and experimental evidence. The NIP set can be generated using four different strategies, which can alleviate biased estimation problems. Lastly, multiple features can be provided for all of the IPP and NIP pairs. Additionally, KUPS provides two benchmark data sets to help researchers compare their algorithms to existing approaches.


Identifying Interface Residues
http://www.ittc.ku.edu/~xwchen/bindingsite/prediction.htm, University Of Kansas

A predictive model for identifying protein interaction sites is proposed. Without using any structure data, the proposed method extracts a wide range of features from protein sequences. A random forest-based integrative model is developed to effectively utilize these features and to deal with the imbalanced data classification problem commonly encountered in binding site predictions. The proposed method is evaluated by using 2829 interface residues and 24,616 non-interface residues extracted from 99 polypeptide chains in the Protein Data Bank. The experimental results show that the proposed method performs significantly better than two other sequence-based predictive methods and can reliably predict residues involved in protein interaction sites. Furthermore, we apply the method to predict interaction sites and to construct three protein complexes: the DnaK molecular chaperone system, 1YUW and 1DKG, which provide new insight into the sequence-function relationship. We show that the predicted interaction sites can be valuable as a first approach for guiding experimental methods investigating protein-protein interactions and localizing the specific interface residues.


Feature Selection for Microarray data
Publication1, Publication2, University Of Kansas

Microarray data is a typical classification problem of small samples and high dimensionality. A practical example is microarray-based cancer classification problems, where sample size is typically less than 100 and number of features is several thousands or higher. We propose two novel methods using minimum reference set (MRS) generated by the nearest neighbor rule and enhanced recursive feature elimination (eRFE). MRS is the set of minimum number of samples that correctly classify all the training samples. It is related to structural risk minimization principle and thus leads to good generalization. Compared to RFE, eRFE considers a group of 'weak' features instead of simply removing individual 'weak' features in RFE. Evaluation shows that eRFE outperforms the original RFE in terms of classification accuracy on various datasets.


Noise whitening in Microarray data
Publication, University Of Kansas

Removing the outlier features which are located outside of Bayesian optimal boundary at the center calculated by Fisher's discriminant function.


Optimization on RBF networks
Publication, University Of Kansas

Dynamic Momentum which can automatically adjust parameters of RBF networks is proposed to produce the optimal model of RBF networks.


A new learning methodology for SVM and RBF Neural Networks
Master's thesis, Chonnam National University, Yeosu

In this thesis, a new learning methodology for SVM (Support Vector Machine) and regularization RBF (Radial Basis Function) neural networks that trains the regularization parameter as well as pattern weights is suggested. Among the traditional algorithms of linear discriminant function (perceptron, relaxation, LMS(least mean squared), pseudoinverse), this thesis shows that the Relaxation Procedure can obtain the maximum margin separation, the hyper-plane of linearly separable pattern classification problem as the SVM classifier does. In the original SVM method, pattern weights are obtained by solving the corresponding QP (Quadratic Programming). So called chunking, working set and steepest descent/ascent methods are introduced to overcome the complexity of storage and the time requirement of solving QP of SVM. In this thesis, KDF (Kernel Discriminant Function) is identified for the first time. Therefore all learning algorithms for linear discriminant can be directly utilized to train the separating kernel hyper-plane in the kernel space. Systematic approach for sequential learning of SVM or other kernel methods is suggested. So the previous ad-hoc method such as KA(Kernel Adatron) can be analyzed by the unified systematic approach. Finally by establishing the relationship between the regularization RBF neural networks and the bias terms, this thesis suggests that the regularization parameter which is usually set by the user externally can be obtained simultaneously within the learning process of pattern weights. Experiment results show that new methods suggested by the thesis have the higher or equivalent performance compared to the conventional approaches.


Sparse Learning on Kernel Relaxation
Publication, Chonnam National University, Yeosu

A new learning methodology for kernel methods that results in a sparse representation of kernel space from the training patterns for classification problems is suggested. Among the traditional algorithms of linear discriminant function, the study shows that the relaxation procedure can obtain the maximum margin separating hyperplane of linearly separable pattern classification problem as SVM classifier does. The sufficient condition to identify the SV patterns, extended SVM and kernel discriminant function are defined.





Publications

Journals
  • Jeong, J. C., X. Lin, et al. (2010). "On Position-specific Scoring Matrix for Protein Function Prediction." IEEE/ACM Trans Comput Biol Bioinform. (link). FUNDING: National Science Foundation (IIS-0644366)

  • Chen, X. W., J. C. Jeong, et al. (2010). "KUPS: constructing datasets of interacting and non-interacting protein pairs with associated attributions." Nucleic Acids Res. (link). FUNDING: National Science Foundation (IIS-0644366)

  • Chen, X. W. and J. C. Jeong (2009). "Sequence-based prediction of protein interaction sites with an integrative method." Bioinformatics 25(5): 585-591. (link). FUNDING: National Science Foundation (IIS-0644366)

  • Yoo, J. H. and J. C. Jeong (2001). "Sparse Representation Learning of Kernel Space Using the Kernel Relaxation Procedure." Journal of Korean Fuzzy Logic and Intelligent Systems 11(9): 817-821. (link).


Conferences
  • Xue-wen Chen, Jong Cheol Jeong: Minimum reference set based feature selection for small sample classifications. ICML 2007: 153-160 (link). FUNDING: U.S Army Research Office (W911NF-06-1-0351), National Science Foundation (IIS-0644366)

  • Xue-wen Chen, Jong Cheol Jeong: Enhanced recursive feature elimination. ICMLA 2007: 429-435 (link). FUNDING: U.S Army Research Office (W911NF-06-1-0351)

  • Eun-Mi Kim, Jong Cheol Jeong, Ho-Young Pae, Bae-Ho Lee: A New Feature Selection Method for Improving the Precision of Diagnosing Abnormal Protein Sequences by Support Vector Machine and Vectorization Method. ICANNGA (2) 2007: 364-372 (link). FUNDING: Korea Research Foundation (BK21-NURI)

  • Eun-Mi Kim, Jong Cheol Jeong, Bae-Ho Lee: A New Approach for Finding an Optimal Solution and Regularization by Learning Dynamic Momentum. ICAISC 2006: 29-36 (link). FUNDING: RRC-HECS, CNU (R12-1998-032-08005-0) and NURI-CEIA, CNU.

  • Jae Hung Yoo, Jong Cheol Jeong (2001). Sparse Representation Learning of Kernel Space Using the Kernel Relaxation Procedure. Conference on Korean Fuzzy Logic and Intelligent Systems 2001 Fall: 60-64. (link).





Academic Service

Research Assistant
  • 2006-present, Information and Telecommunication Technology Center (ITTC) , University Of Kansas

    Research Assistant at Bioinformatics and Computational Life Sciences Laboratory (BCLSL) in ITTC. Worked with Dr. Xue-wen Chen on NSF CAREER project. Major projects includes developing new feature selection method (MRS), algorithms for predicting protein-protein interactions, tools for microarray analysis, and university of Kansas proteomics service (KUPS)



  • 1999 - 2002, The Research and Development Center for Facility Automation and Information System (FAIS), Chonnam National University, Yeosu

    Developing optimized stacking method for import/export containers Stack operation: a kind of expert systems for container yard management for internal terminals. Containers are often stacked in the yards for days awaiting the arrival of a vessel for the export containers or the trailer truck for the import containers. A major goal of this system is to place the container in the yard so that the required container will always be on the top of the stack. (JAVA & Visual Basic)



  • 2001 - 2002, Chonnam Regional Environmental Technology Development Center, Chonnam National University, Yeosu

    Major project includes a design of telemonitoring system (TMS) fault diagnosis systems



  • 2001 - 2002, Technology Development Center for Small and Medium Business, Chonnam National University, Yeosu

    Major project includes real-time impurity detection in (high-density polyethylene) HDPE pellets





Teaching Assistant


Reviewer



Total visit: , since January 06, 2011