The SPEED database has been designed to be extensible and scalable. This means that, as more biological data is processed, the database will be able to accommodate that data. This includes a fully relational database structure which allows for the incorporation of new data types and classes. An example of this is the structure used to house the expression information about each gene as culled from Unigene. Tissue types as used by Unigene are not heirarchically linked; we have created a tissue heirarchy which allows tissues to be related to one another. For example, a gene which shows expression in the cerebellum might not show expression in the medulla, however a whole brain tissue mixture will show expression of that gene. Different experiments focus on varying levels of tissue, ranging from whole embryonic mixtures or pools of varying tisseus to specific cell types within a tissue. The database allows for that information to be queried in a logical way. In addition, most information about tissue of expression is currently culled from Unigene. This information is generally indiciative of the presence of gene expression in a tissue, without any information about the level of expression. Similarly, in existing databases, little effort is made to track whether expression has been tested for in a tissue; the absence of expression in specific tissues or cell types is just as important as the presence of expression, and the database makes provisions for recording both the absence of expression and a lack of data collection for a tissue. The highly extensible and open nature of the database will encourage submission of data in a format which will be most useful to a broad range of researchers throughout the field.
Data in the SPEED database is stored within a MySQL database. However, provisions for flat-file output of the data are underway. We propose to use the AGAVE XML standard as the main output standard for all flat-file information. In addition, we hope to develop a website which allows for data deposition into an AGAVE XML structure prior to incorporation into the relational database. This will help keep the database extensible as well as help maintain the open nature of the architecture; MySQL is an open-source platform, and AGAVE XML, while developed by several industry leaders, has an open-source community dedicated to it's development. This assures compatibility between the SPEED DB and industry components as the pervasiveness of privately curated genomic sequence grows within the field.
We see a clear need to develop the SPEED DB to incorporate a more focussed set of data relating to primates. As researchers move forward with sequencing polymoprhism from various human populations, the need for outgroup information at many different levels becomes apparent. While the collection of chimpanzee gene information assures that polymorphisms can be defined as to their ancestral status, chimpanzees are related to humans closely enough that comparison of divergence information to polymorphism for the detection of changes in evolutionary mode (for example, utilizing a McDonald-Kreitman test (4)) are often inconclusive. In addition, it would be useful to understand the phylogenetic lineage of specific genes, as opposed to the species lineage, when that gene is suspected of playing a causal role in human disease. For this reason, the next phase of the SPEED database will involve preparing for the annotation of a large number of genes from various primate species, particularly macaca species. As we have already developed the database architechture necessary to incorporate any amount of data, several pieces will need to be developed. These include interfaces to expression information gathered using procedures such as the Affymetrix chip, incorporation of reference information for each gene into the DB architecture, and specialized search pages capable of performing searches geared towards the special needs of the primate comparative genomics community. The nature of the SPEED DB architecture allows for this next phase of work to be met without re-creating the database; indeed, existing tables will be used to incorporate most of the data. Importantly, we anticipate needing a physical anthropologist/primatologist who can help shape the way that the data is architected at this level.
We believe that the nature of biology in the next 15 years will be coupled to and shaped by the way that information is handled. The need for a biologically based system for housing and searching biological data is clear. However, information theory is a mature field which should not be ignored in this endeavor. Indeed, the nature of "data-mining" is usually not clearly understood within the new field of Bioinformatics. Many researchers see data-mining as a hypothesis free endeavor, which would get away from some of the fundamental underpinnings of biology. This is not the case. Information theory notes that hypotheses do indeed exist in the examination of most data; these hypotheses range from very rigid, strong hypotheses to weak hypotheses which might have less clear results. An example of a "strong" hypothesis from a biological standpoint would be as follows; "Genes which are expressed in fewer tissues in humans have a higher rate of Ks divergence between humans and chimpanzees." This is a strong and immediately testable hypothesis, if the data is available and a database has been designed appropriately. This is an example of the type of question the SPEED database is meant to answer. Another, weaker hypothesis might be as follows: "The rates of evolution of genes should vary widely across the mammalian order." This, also, is a testable hypothesis, though due to the nature of how the question is framed, the test is less certain. The SPEED DB could be used to gather the information necessary to address this hypothesis. On the other hand, some informatics work is question, but not hypothesis, driven; "List all gene orthologs with a KA<0.5 between human and mouse." This is a simple task. On the other hand, questions such as thi might lead to hypotheses which need to be followed up by additional primary data collection; if most of the genes listed by the above query were found, for example, to be diverging rapidly between humans and chimpanzees, more data might be collected to see why the discrepancy in rates exists. For this reason, how to deal with a wide range of types of questions and hypotheses is a primary concern for the DB. This type of work necessitates the inclusion of a computer scientist, specializing in information theory and database analysis, into the ongoing work of the SPEED database.
1.R. Grantham, Science 185, 862-864 (1974).
2.W. Makalowski, M. S. Boguski, Proc. Nat. Acad. Sci. 95, 9407-9412 (1998).
3.GCG. (Genetics Computer Group, Madison, WI, 1999).
4.J. H. McDonald, M. Kreitman, Nature 351, 652-654 (1991).