- What is Bioinformatics?
- DNA Sequencing
- Microarray Technology
- Bioinformatics Institute
Mark Levy began his biochemistry project at a computer. He wanted his high school science project to cover important new ground, so he looked for a piece of genetic material that had not been well researched. His search of public databases found matrix metalloprotease, an enzyme that degrades protein and is important for wound healing in animals, but whose function in plants is unknown.
Levy then searched the Arabidopsis thaliana database for amino-acid domains known from the literature to be present in matrix metalloprotease. Why Arabidopsis? This fast-growing member of the mustard family has the distinction of having a small genome with relatively little repeated DNA. For that reason, it is the subject of an international genome-sequencing project. In December 2000, scientists reported that Arabidopsis thaliana is the first plant to have its genome sequencing completed. Thus, Arabidopsis has become a model — a kind of Rosetta Stone — for more complex crop plants.
Levy's search of the Arabidopsis database for genes that encode matrix metalloprotease found four candidates, and he identified one that no one had written about. Finally, he entered a lab. Working with Eric Beers, associate professor of horticulture at Virginia Tech, Levy isolated and cloned the gene from Arabidopsis and he won a grand prize in a regional science fair.
"For a high school student to spend just a few hours on a computer and in a lab and clone a targeted gene would not have been possible 15 years ago, when I began as a graduate student," says Beers.
Neither would Beers have been able to do his much more advanced work, studying the role of proteases in programmed cell death, without the Arabidopsis database and today's computational infrastructure — hardware, software, networks, databases, and algorithms.
In return, Beers’ work is expanding the potential of the Arabidopsis database. He is using Arabidopsis to study xylem, the layer of tissue that carries water from the roots to the leaves. The work not only advances his understanding of proteases in the xylem, but is creating a tool of potential economic importance to the wood industry.
How can a tiny wildflower help in the study of trees?
In the Beers' lab, visiting scholar Cheng Song Zhao created a xylem cDNA library (a collection of all of the genes expressed in Arabidopsis xylem). Zhao is determining partial sequences for hundreds of these cDNAs, selected at random. These sequences, known as expressed sequence tags (ESTs), often provide enough information to identify the selected gene by matching entries in available databases, such as GenBank.
"Our current EST database represents the beginning of a bioinformatics approach to defining the properties that distinguish the xylem of the model plant Arabidopsis from that of the economically important tree species, loblolly pine and poplar, for which similar EST projects have been conducted," says Beers. "This EST database will be an important contribution to our efforts to determine how useful Arabidopsis will be as a model system for studying wood formation and wood quality."
A suite of information technology tools has given the biological sciences the potential to learn more about the processes of life at the most basic level, to learn it faster, and to apply new knowledge effectively and amazingly for problem solving and discovery. To realize that potential, the Virginia Bioinformatics Institute has been launched at Virginia Tech.
"In the last two decades, a new generation of scientists with expectations based on computing's increasing capability and speed have entered the biochemistry arenas," says Raymond Dessy, emeritus professor of chemistry at Virginia Tech who taught thousands of scientists to use computers in their labs.
"In the 1980s, DNA material was spotted -- by hand -- onto gels, which were studied under powerful microscopes. Now, robots spot nanoliters of material onto slides, which are analyzed 6,000 at a time," says Cynthia Gibas, a faculty member with the Fralin Biotechnology Center at Virginia Tech who researches the relationships between protein structure and function. This microarray technology permits scientists to examine hundreds of thousands of genes at the same time to see which genes are expressed in a single organism under experimental conditions.
In addition to using computational tools, life scientists are forming new partnerships. For example, a geneticist, a computer scientist, molecular biologists, and statisticians from Virginia Tech and North Carolina State University have teamed up to discover how to help trees resist environmental stresses, such as drought. "The focus is on developing analytical methods for microarray data with increased resolution and greater predictive power," explains Lenwood Heath, associate professor of computer science at Virginia Tech.
As with Beers' work, the researchers are searching for patterns in the data and modeling functional relationships among the genes. In this case, they are analyzing gene expression data from microarray technologies to measure variations among pine trees, investigating the genetic basis for response to plant stress, and developing microarray experiments to address hypotheses about stress response and wood formation in loblolly pine.
Layne Watson, professor in mathematics and computer science, defines bioinformatics as a tool for three levels of problem solving -- managing data, sequence analysis, and inferring function.
The first level, he says, is "just getting the data and having confidence in it. Presently, the experiments which decode or sequence segments of a plant or animal genome are not entirely reproducible. That is, scientists rarely get exact sequences when they process DNA-sequence gel images. Having better algorithms for evaluating gel images will provide more precise results."
Clark Tibbetts, associate director of the new Virginia Bioinformatics Institute (VBI) agrees. "Bioinformatics is most effective when the computational experts are involved in how the data are collected, as well as the design of the analysis," he says. "The VBI will operate as a partner in data creation and use."
"The second level of problem solving for bioinformatics is where much of the work is being done now," Watson says. "You have fragmentary and conflicting data and you must construct and extract the important parts of the genetic sequence. This sequence analysis requires sophisticated discrete algorithms to search for patterns."
Heath explains, "When biologists find a particular genetic sequence, they can look in the database for a match or for something similar, which might have an evolutionary relationship or a functional relationship. That gives the biologists clues that certain functions are performed by certain genes."
Plant scientist M.A. Saghai Maroof used DNA similarity between two disease resistance genes from tobacco and Arabidopsis to develop a general technique for the identification and isolation of new disease resistance genes from soybean and other crops. His discoveries in molecular marker technology have advanced plant science, and he and colleagues at Virginia Tech have developed improved varieties of corn and soybeans.
"We want to be able to explore questions regarding the specific functions of certain genes," says Heath. For example, Malcolm Potts, professor of biochemistry, and Richard Helm, professor of wood science, are working on the genomes of several cyanobacteria. These organisms are capable of surviving under extreme environmental stress. The team is identifying the genes central to stabilization in order to engineer this trait into other organisms' tissue and cell types.
Another second-level bioinformatics task is piecing information together. "Chromosomes are so large that experimental work is done by chopping them into pieces small enough to determine DNA sequences," Heath says. "Computational work is required to put the pieces back together -- to join pieces of a few hundred genetic sequences into DNA code made up of millions of sequences."
"The third level of bioinformatics problem solving is to try to infer function," says Watson. "At this stage the algorithms and mathematics are the most sophisticated, requiring both discrete and continuous algorithms to make the connection between the genetic sequence and the biological function, and to create models that describe and predict interactions at the level of the cell cycle.
"The difference between the second and third level is that at the second level a scientist may observe that disease occurs when a specific gene is missing or flawed. At the third level, we have an explanation of how the missing or flawed gene results in a physiological phenomenon.
"Treatment is still possible at the second level. We may know that a certain substance prevents or eases the symptoms of Parkinson's disease without knowing why, for instance. But a lot more can be done once we know how substances work -- how and why genes are responding," Watson says.
"There's a flow of information at every stage," says John Tyson, University Distinguished Professor of Biology at Virginia Tech who is a pioneer in the mathematical modeling of the behavior of living cells. "In particular, we need computational tools to predict cell physiology from the ‘wiring diagrams' of underlying molecular control systems. What are the equations that describe how interacting proteins control the behavior and misbehavior of cells?”
Pedro Mendes, computational biochemist with the Virginia Bioinformatics Institute, is involved in providing just those computational tools to predict behavior of living systems. Mendes’ program — Gepasi — is being used by many scientists to simulate the dynamics and control of complex metabolic pathways. He is now preparing a new evolution of the program that will be able to simulate more complete models joining the biochemical and gene expression knowledge. This will be the required tool to integrate the experimental observations into predictive models in the computer.
"Now that whole genomes are completely sequenced, a new wave of biochemistry experiments needs to be done to identify the functions of all the genes," says Mendes. “This will happen at high-throughput and it will require development of a whole new area of bioinformatics applied to biochemical pathways."
For the massive amounts of computer power required, Virginia Tech and the VBI have purchased a Beowulf parallel supercomputer.
"But it won't necessarily require expensive supercomputers for all computational tasks," says Gibas, who came to Virginia Tech in 1999 from the National Center for Super Computing Applications. "A lot of bioinformatics is not just having brute force computing power. It's having the intelligence to design a system for looking at your data in a way that is of interest to a biologist." Networked clusters of mid-sized computers can accomplish a lot, she says, if common data standards and transfer protocols are established so scientists in different research centers can build upon each others' progress.
"The Virginia Bioinformatics Institute's close association with the university is important," says Tibbetts. "Bioinformatics needs to be used not only to support the expansion of knowledge, but to enable new and unanticipated discovery. It's a philosophy that is easy to lose sight of in such an applied science.
"The advantage of Virginia Tech is we embrace the generation of information rich data streams as well as management and analysis. We apply computation to biology, and we allow biological systems to inspire computation," he says. "What we are learning about biological systems turns out to have significant advantages in approaching complex computational problems.
"Biologically-inspired models can be applied to the computational process," Tibbetts explains. "Results might be artificial neural networks, genetic algorithms, or analog optical pattern recognition."
Tyson sees universities creating the knowledge that the private sector applies. "Economic development has always followed scientific advancement," he says. "You never know, when you're out there on the fringes of science, where the next great discovery is going to be. But in the long run, it has always paid off more than you have put into it."