Data generated from projects such as the Human Genome Mapping Project and soon structural genomics projects not only provide a rich source of information for researchers to use (see figures below) but also one of the biggest challenges the biological research community has faced in terms of data management, manipulation and information dissemination. The goal of our research, in collaboration with the San Diego Supercomputer Center (SDSC) at the University of California San Diego (UCSD) is two-fold. First, to provide efficient pathways for researchers to learn about, interface with and utilize available data sources that would benefit their research. Second, to conduct research which leads to the discovery of new knowledge from that data. This implies, new techniques for data management, particularly from new emerging biotechnology, for example DNA microarrays, domain specific query languages and algorithms for comparative analysis of sequence and structure.
Courseware Development Project
The dissemination of new techniques and methods in computational biology is one of the biggest hurdles facing the experimental biology community. We have and continue to develop on-line courses and tools for content providers to provide on-line courses using both traditional Web access and new forms of multimedia.
Data Access Interfaces
An impediment to data analysis that currently many researchers must deal with is the use and understanding of the non-uniform collection of interfaces that exist for data access and manipulation. In sequence analysis for example, there are a variety of data sources that might range from internally generated sequences to unpublished sequences available on the Internet half way around the globe. To manipulate this data, there is also a variety of tools that might run on a local Macintosh or PC computer or through a minicomputer. Several Internet sites also provide specialized tools with which to access their data sets (eg. the BLAST Sequence search algorithm at NCBI to interrogate their Genbank database). A goal of our research is to provide the researcher with as uniform and comprehensive an interface as possible to access and manipulate data, transparently using those resources that most efficiently perform the task.
Apoptosis Database
Tools have previously been developed to support data on specific protein families which includes structure, sequence, function, and relationship to disease. A subset of these tools now used to maintain and distribute the Protein Data Bank (PDB). This project is developing a resource to support and subsequently mine information related to apoptosis.
Other
A complete list of other projects in collaboration with the University of California at San Diego (UCSD) and the San Diego Supercomputer Center (SDSC) can be found at
http://www.sdsc.edu/pb.
Summary
The requirement for comprehensive and integrated means to deal with large data sets has never been greater, or in greater demand by the ordinary researcher. The challenge is to deliver these tools in a timely fashion, keeping pace with the latest advances in data analysis at the same time providing educational avenues.
Multiple alignment of c-jun 3'UTR oncogene sequences from 5 animal species permits the determination of conserved regions, and generates the consensus sequence seen at the bottom of the figure. Knowledge of conserved regions can be further used in database searching for homologous sequences, prediction of evolutionary distances between the sequences and in structure/function determination.
cAMP dependent protein kinase taken from the Protein Kinase Resource showing the bound PKI inhibitor (light gray ribbon), conserved residues (white spheres), bound ATP (white sticks) and the site of phosphorylation (white arrow). The Molecular Interactive Collaborative Environment (MICE) project seeks to great a gallery of such images that can be shared and form the basis of an interactive collaborative session using only the Internet.
Multiple alignment of c-jun 3'UTR oncogene sequences from 5 animal species permits the determination of conserved regions, and generates the consensus sequence seen at the bottom of the figure. Knowledge of conserved regions can be further used in database searching for homologous sequences, prediction of evolutionary distances between the sequences and in structure/function determination.
cAMP dependent protein kinase taken from the Protein Kinase Resource showing the bound PKI inhibitor (light gray ribbon), conserved residues (white spheres), bound ATP (white sticks) and the site of phosphorylation (white arrow). The Molecular Interactive Collaborative Environment (MICE) project seeks to great a gallery of such images that can be shared and form the basis of an interactive collaborative session using only the Internet.