|We have developed a new, relational database as a prototype for future databases to be used for disease gene discovery, gene annotation and reporting, and searching for genes for future studies in model organisms. The database is called SPEED, short for "Searchable Prototype Experimental Evolutionary Database". The database incorporates 5 layers of information about the genes residing in it; the expression information from a gene (as reported in Unigene), the cytological location of the gene (if available), the ortholog of each gene in the available species within the database, the divergence information between species for each gene, and functional information as reported by OMIM and the Enzyme Commission (EC) reference number of genes. Tables have also been created to help record polymorphism data and functional information about specific changes within or between species, such as measured by Granthams distance (1) or model organism studies. A web interface has been designed which enables complex searchable queries to be performed even by people with little familiarity with relational databases or SQL format. The web-based front-end to the database is being developed utilizing HTML, Perl, the Perl DBI for MySQL, PHP, and, when necessary, Java. This should allow for the best possible compatibility for accessing the information in the database across a wide range of systems and platforms.
Currently, the database houses information from diverse taxa including humans, chimpanzees, Old Wold monkeys, mice, rats, carnivores, ruminants, rabbits, and marsupials. The Mouse-Human orthology information, the backbone of the SPEED database, is taken from build 25 of the NCBI GB-MGD database synteny maps as reported in NCBI (http://www.ncbi.nlm.nih.gov/Homology). Other species orthology is gathered by BLAST and hand curation which removes duplicate gene matches and utlizes other information (such as gene name, gene function, and published references) to help assure orthology. Reference sequences (REFSeq, NCBI ref) of all genes in the database are used for all analyses. This is similar to the method of Makalowski and Boguski (2), but we believe the curation method here helps to assure strict orthologous comparisons at the cost of losing information about poorly studied ortholog sets. Alignments between coding sequences of orthologous genes are performed utilizing a global alignment algorithm and the Framealign program (3), and divergence information is generated using the NewDiverge program (3). Approximately 5000 mouse-human orthologs are housed in the database, along with approximately 1600 rat orthologs to mouse and human genes, 200 genes from Old World monkey species orthologous to human genes, 200 genes from ruminant species, 100 rabbit genes, 100 genes from carnivore species, and 60 genes from marsupial species. We hope that the format of this database serves as a prototype for future databases which incorporate existing gene annotation into a relational structure. This should aid in the creation of user-friendly databases which allow for biologically relevant searches of genetic information.
Use of the SPEED database in publication should cite: SPEED: a molecular-evolution-based database of mammalian orthologous groups Eric J. Vallender; Justin E. Paschall; Christine M. Malcom; Bruce T. Lahn; Gerald J. Wyckoff Bioinformatics 2006; doi: 10.1093/bioinformatics/btl471