I participated in at an interesting conference in Japan last week, "The Fusion Between Policy Science and Information and Communication Technology," that brought together social scientists and computer scientists. I particularly enjoyed a talk by John Zysman and the panel discussion on whether and how advanced computing can help policy (more on those topics later, perhaps).
I was asked to speak on "Scientific Impact of Grid Computing." I enclose my talk below.
Scientific Impact of Grid Computing
The subject of my remarks today is the impact of grid computing on science. I will first provide some context, reviewing how the nature of science is changing as a result of (among other things) technological developments. I will then explain the relevance of Grid technologies to these developments and review experiences to date with the application of those technologies, and finally I will talk about how the advent of service oriented approaches promises (in my view) to transform many aspects of scientific research in the future.
First, context. We are talking today about grid and science for a reason, and that reason is the sustained exponential change in technology that has over the past 50 years been producing ever more data, enabling ever more computing, and connecting us all ever more closely.
The consequence of these developments is not only quantitative but also qualitative changes in how we tackle some of the most challenging and urgent scientific problems of our age, from climate change to disease. Increasingly, research involves the analysis of large quantities of data, large-scale numerical simulation, and intensive and interdisciplinary collaboration. The technologies that we used previously to store, transmit, process, and communicate data—workstations, DAT tapes, Fedex, even scientific journals, some would argue—are no longer as effective as they were.
We also see the emergence of new research methodologies and organizational structures, as evidenced by this image of the collaboration that is building the Large Hadron Collider at CERN. In this project, which is not so different in broad strokes from those that sequenced the human genome or that managed the response to SARS, we have different overlapping groups of varying sizes, some sharing data, some competing, all ultimately contributing to the solution of the problem at hand.
These developments have many profound implications for research methodologies, education, resource allocations, and so forth. In particular, they demand information technology infrastructures, and Grid is part of this emerging new technology landscape.
I should also note, just as an aside—but an important aside—that as science becomes more information intensive, so the importance of computer science increases. Astronomer George Djorgovski goes so far as to claim that, “applied computer science is now playing the role that mathematics did from the 17th through the 20th centuries: providing an orderly, formal framework and exploratory apparatus for other sciences.” As a computer scientist, that message appeals to me!
In this context, then, Grid has come to play a valuable role as a unifying concept and technology for applications that require the federation of resources (computers, storage, data, people, etc.). Why the name “grid”? Having bought a new rice cooker, we simply plug it in: the power grid obviates the need to also buy and install a new electrical generator. By analogy, information technologists refer to “the grid” when talking about on-demand computing.
Like its namesake, a grid is a mix of technology, infrastructure, and standards. The technology is software that allows resource providers to federate computers, storage, data, networks, and other resources, and for resource consumers to harness those federated resources when needed. We can categorize this software as “system-level” (software that implements common management interfaces to underlying resources, such as the open source Globus software that I have been involved in developing) and user-level (such as the Ninf software from Japan). Together, these software bridge the gap from applications to resources.
The infrastructure comprises the physical resources and services that must be maintained and operated for this resource federation and access to occur. Examples of services include registries and certificate authorities.
Finally, the standards codify the messages that must be exchanged, and the policies that must be followed, to achieve those goals.
Together, this technology, infrastructure, and standards allow us to bridge the otherwise substantial gap between applications and the physical computers, storage systems, networks, and other devices that those applications need to operate.
Let us move on now to a how Grid technology is being used
in science. Today, this use is primarily directed at enabling, as I have
indicated, on-demand access to computing, storage, and other devices. For
example, the U.S. Network for Earthquake Engineering Simulation (NEES) implements
service interfaces that allow for remote access to, and monitoring and control
of, experimental apparatus for earthquake engineering as well as simulation
codes and data archives. NEES has been used to conduct distributed hybrid
experiments, in which components of a large structure are tested via a mix of
numerical simulation and physical simulation at different sites. This is a
technique pioneered in
The Earth System Grid provides access to large climate model datasets such as those produced by the International Panel on Climate Change assessment. The substantial impact of this service on the climate research community is indicated by the large number of users, the number of data downloaded, and the number of resulting research articles.
The TeraGrid is the premier U.S. “cyberinfrastructure,” to use a term popular in the U.S. TeraGrid links supercomputers and storage systems at eight sites with an extremely fast network, and deploys standard Grid software across these resources so that scientists can obtain large amounts of computation and storage when required to support their science.
By thus standardizing on interfaces and policies, TeraGrid seeks to transform its diverse sites and computers into interchangeable providers of computing power. An application (for example, a medical data analysis application) can then acquire needed computing, storage, and network capacity to achieve its scientific objectives.
Increasingly, TeraGrid is being viewed as a system that does not simply provide computing resources for individual scientists, but also hosts services for communities. This emerging new role is significant and I believe will result in a considerably greater impact on the scientific community.
For example, PUMA is an information system that provides access to data computed by integrating genomic and proteomic data. To its several thousand users, it is simply a Web site. However, behind the scenes, PUMA is making extensive use of TeraGrid and other Grid infrastructures to perform its data integration. Indeed, PUMA code routinely runs on 1000 processors when integrated new data.
As these examples show, Grid as a technology for on-demand access to computing is already widely deployed, and is having a significant impact on science in numerous fields. Nevertheless, I believe that these successes are only a first step towards a far greater impact on science. This leads me to the third part of this talk, in which I discuss what I see as the next major thrust for grid computing and for science as a whole.
In traditional approaches to research, communication among researchers occurs primarily via publication in peer-reviewed journals. Information technology may play a role as a tool during the research process, but does not change the nature of this communication process.
What I call service-oriented science adds a new modality of communication, namely the creation of computational services—that is, network-accessible programs that implement a convenient interface and that provide access to data and/or computational capabilities.
Such services allow for new research methodologies, as follows. Someone publishes a service: for example, PUMA, which, as I described earlier, provides access to derived data products—or, perhaps, to an enhanced PUMA that allows its clients to supply their own genomic data to be integrated with that maintained by PUMA.
Another researcher discovers that service and uses it in their research. In a first instance, they may simply query PUMA from their Web browser. However, as they get more ambitious, they may also compose calls to PUMA with calls to other services (for example, a service for computing metabolic pathways) in what we call a workflow. In this way, they can scale up dramatically the number of questions that they can ask and get answered. This automation of data analysis tasks is an important consequence of service-oriented science.
Even more interesting is what can happen next. The researcher may decide that this workflow that they have developed captures a broadly useful analysis technique, and decide to publish that workflow as a new service that may itself be discovered and called by others. Thus we may achieve a virtuous circle of innovation.
The astronomers have been pioneers in the adoption of service-oriented science techniques. If you are not familiar with what they are doing, I encourage you to study it: it is very impressive.
So-called virtual observatories are providing on-line access to digital sky surveys at different wavelengths, thus allowing astronomers to ask sophisticated questions from the comfort of their desks: for example, what objects are visible in the infrared but not the optical? (The answer to this query can identify candidate brown dwarfs, a class of star identified only recently.) What makes this sort of question possible is that different archives in different countries support the same service interfaces and furthermore publish information about their content into standardized registries.
I need to emphasize that while service-oriented science has tremendous potential, there are obstacles to achieving the virtuous circle of innovation that I mentioned earlier. These obstacles include not only technical concerns (how do we create, publish, register, discover services) but also methodological and policy issues. I mention three such issues here; I am sure that you can think of others.
First, by reducing barriers to accessing and using data and computational procedures, we can significantly accelerate the research process, which in turn can allow researchers to ask more questions and thus, we may hope, be more innovative. This is not in itself a problem, but does require new ways of thinking about research.
Second, as data and procedures are made available as services, they become “results” in a similar manner to data published in scientific journals: that is, scientific conclusions based on data and assumptions, and on which others may build further research. But how can those others know whether to trust the data or procedures on which they build? How do they document their assumptions? We need mechanisms for evaluating quality and documenting provenance. Otherwise we will just construct a house of cards.
Third, there is the question of how we motivate people to contribute and run services. Reward systems need to change so that researchers who do a good job of constructing services get recognized and promoted. We also need to train people to create services: arguably, we need a new class of “data scientists” expert in these issues. Finally—and here is where we get back to Grid—we need substantial new infrastructure to host services. Let me explain why.
Here is a somewhat simplistic view of a virtual observatory. Let us assume that we have configured what is now a rather small digital sky survey, the Sloan, some 10 terabytes in size, to run on our small local server. Initially, we and our users are delighted: astronomers around the world can use their Web browsers to retrieve data about individual astronomical objects. However, we soon find that astronomers are writing programs that ask more complex questions, involving perhaps tens of thousands of objects. And then the number of people asking questions increases. Suddenly we need many many computers to meet demand, and that is not something that our small group is set up to handle.
Such issues point to a new role for the traditional supercomputer center, as a hoster of services. I will illustrate how this can work by describing a service we have constructed at Chicago, in collaboration with some astronomers. The problem we have addressed is that of stacking images from different areas of the sky, something one does to improve signal to noise ratios when looking at, for example, quasars. One may want to access tens of thousands of cutouts from different areas in the sky, which is both a data-intensive and a computation-intensive task.
We have built a service to perform this function that runs on the TeraGrid. This service is constructed to acquire and release resources dynamically as load varies, thus allowing it to provide good response times regardless of load. To give an idea of the revolutionary impact such technologies can have, we are able to perform in 3 minutes a stacking that previously took a postdoc 3 months. The need for such services is going to explode in the coming years, as data volumes increase, the analyses performed on that data become more sophisticated, and users become more comfortable with service-oriented approaches to science.
To summarize, I have addressed three issues in my talk. First, the broader context, which is the impact of technological exponentials on scientific methodologies and organizations, and the consequent need for new information technology. Second, the important role that Grid plays as an unifying concept and technology for applications that require the federation of distributed resources, and the successes that have been achieved in using Grid technologies to enable on-demand access to computers, storage, data, and other resources. Third, the significance of the transition that we are currently seeing to service-oriented science, which I think has profound implications for what it means to be creative, to communicate scientific results, and to build infrastructure for science.