I participated on Thursday in a panel at the HPC Users Conference, run by the U.S. "Council on Competitiveness." I spoke on how the U.S. national laboratories can partner with companies in a mutually beneficial way. The panel reinforced for me some important points that I think need to be more broadly appreciated:
The two industrial participants in the panel were interesting. Terry Talley talked about how Axciom is using Grid computing internally. The CTO from DreamWorks talked about the amount of computing involved in modern animated features: 10,000,000 CPU hours for Shrek1, 15,000,000 CPU hours for Shrek2, and so on. He also talked about how they are using DOE supercomputers in an exploration of interactive (instead of overnight) rendering. So even the most advanced users can imagine using computers in far more powerful ways.
Charlie Catlett writes about plans to deploy attribute-basd authorization on TeraGrid. It is neat to see people working to make national-scale authentication and authorization work.
I participated in at an interesting conference in Japan last week, "The Fusion Between Policy Science and Information and Communication Technology," that brought together social scientists and computer scientists. I particularly enjoyed a talk by John Zysman and the panel discussion on whether and how advanced computing can help policy (more on those topics later, perhaps).
I was asked to speak on "Scientific Impact of Grid Computing." I enclose my talk below.
Scientific Impact of Grid Computing
The subject of my remarks today is the impact of grid computing on science. I will first provide some context, reviewing how the nature of science is changing as a result of (among other things) technological developments. I will then explain the relevance of Grid technologies to these developments and review experiences to date with the application of those technologies, and finally I will talk about how the advent of service oriented approaches promises (in my view) to transform many aspects of scientific research in the future.
First, context. We are talking today about grid and science for a reason, and that reason is the sustained exponential change in technology that has over the past 50 years been producing ever more data, enabling ever more computing, and connecting us all ever more closely.
The consequence of these developments is not only quantitative but also qualitative changes in how we tackle some of the most challenging and urgent scientific problems of our age, from climate change to disease. Increasingly, research involves the analysis of large quantities of data, large-scale numerical simulation, and intensive and interdisciplinary collaboration. The technologies that we used previously to store, transmit, process, and communicate data—workstations, DAT tapes, Fedex, even scientific journals, some would argue—are no longer as effective as they were.
We also see the emergence of new research methodologies and organizational structures, as evidenced by this image of the collaboration that is building the Large Hadron Collider at CERN. In this project, which is not so different in broad strokes from those that sequenced the human genome or that managed the response to SARS, we have different overlapping groups of varying sizes, some sharing data, some competing, all ultimately contributing to the solution of the problem at hand.
These developments have many profound implications for research methodologies, education, resource allocations, and so forth. In particular, they demand information technology infrastructures, and Grid is part of this emerging new technology landscape.
I should also note, just as an aside—but an important aside—that as science becomes more information intensive, so the importance of computer science increases. Astronomer George Djorgovski goes so far as to claim that, “applied computer science is now playing the role that mathematics did from the 17th through the 20th centuries: providing an orderly, formal framework and exploratory apparatus for other sciences.” As a computer scientist, that message appeals to me!
In this context, then, Grid has come to play a valuable role as a unifying concept and technology for applications that require the federation of resources (computers, storage, data, people, etc.). Why the name “grid”? Having bought a new rice cooker, we simply plug it in: the power grid obviates the need to also buy and install a new electrical generator. By analogy, information technologists refer to “the grid” when talking about on-demand computing.
Like its namesake, a grid is a mix of technology, infrastructure, and standards. The technology is software that allows resource providers to federate computers, storage, data, networks, and other resources, and for resource consumers to harness those federated resources when needed. We can categorize this software as “system-level” (software that implements common management interfaces to underlying resources, such as the open source Globus software that I have been involved in developing) and user-level (such as the Ninf software from Japan). Together, these software bridge the gap from applications to resources.
The infrastructure comprises the physical resources and services that must be maintained and operated for this resource federation and access to occur. Examples of services include registries and certificate authorities.
Finally, the standards codify the messages that must be exchanged, and the policies that must be followed, to achieve those goals.
Together, this technology, infrastructure, and standards allow us to bridge the otherwise substantial gap between applications and the physical computers, storage systems, networks, and other devices that those applications need to operate.
Let us move on now to a how Grid technology is being used
in science. Today, this use is primarily directed at enabling, as I have
indicated, on-demand access to computing, storage, and other devices. For
example, the U.S. Network for Earthquake Engineering Simulation (NEES) implements
service interfaces that allow for remote access to, and monitoring and control
of, experimental apparatus for earthquake engineering as well as simulation
codes and data archives. NEES has been used to conduct distributed hybrid
experiments, in which components of a large structure are tested via a mix of
numerical simulation and physical simulation at different sites. This is a
technique pioneered in
The Earth System Grid provides access to large climate model datasets such as those produced by the International Panel on Climate Change assessment. The substantial impact of this service on the climate research community is indicated by the large number of users, the number of data downloaded, and the number of resulting research articles.
The TeraGrid is the premier U.S. “cyberinfrastructure,” to use a term popular in the U.S. TeraGrid links supercomputers and storage systems at eight sites with an extremely fast network, and deploys standard Grid software across these resources so that scientists can obtain large amounts of computation and storage when required to support their science.
By thus standardizing on interfaces and policies, TeraGrid seeks to transform its diverse sites and computers into interchangeable providers of computing power. An application (for example, a medical data analysis application) can then acquire needed computing, storage, and network capacity to achieve its scientific objectives.
Increasingly, TeraGrid is being viewed as a system that does not simply provide computing resources for individual scientists, but also hosts services for communities. This emerging new role is significant and I believe will result in a considerably greater impact on the scientific community.
For example, PUMA is an information system that provides access to data computed by integrating genomic and proteomic data. To its several thousand users, it is simply a Web site. However, behind the scenes, PUMA is making extensive use of TeraGrid and other Grid infrastructures to perform its data integration. Indeed, PUMA code routinely runs on 1000 processors when integrated new data.
As these examples show, Grid as a technology for on-demand access to computing is already widely deployed, and is having a significant impact on science in numerous fields. Nevertheless, I believe that these successes are only a first step towards a far greater impact on science. This leads me to the third part of this talk, in which I discuss what I see as the next major thrust for grid computing and for science as a whole.
In traditional approaches to research, communication among researchers occurs primarily via publication in peer-reviewed journals. Information technology may play a role as a tool during the research process, but does not change the nature of this communication process.
What I call service-oriented science adds a new modality of communication, namely the creation of computational services—that is, network-accessible programs that implement a convenient interface and that provide access to data and/or computational capabilities.
Such services allow for new research methodologies, as follows. Someone publishes a service: for example, PUMA, which, as I described earlier, provides access to derived data products—or, perhaps, to an enhanced PUMA that allows its clients to supply their own genomic data to be integrated with that maintained by PUMA.
Another researcher discovers that service and uses it in their research. In a first instance, they may simply query PUMA from their Web browser. However, as they get more ambitious, they may also compose calls to PUMA with calls to other services (for example, a service for computing metabolic pathways) in what we call a workflow. In this way, they can scale up dramatically the number of questions that they can ask and get answered. This automation of data analysis tasks is an important consequence of service-oriented science.
Even more interesting is what can happen next. The researcher may decide that this workflow that they have developed captures a broadly useful analysis technique, and decide to publish that workflow as a new service that may itself be discovered and called by others. Thus we may achieve a virtuous circle of innovation.
The astronomers have been pioneers in the adoption of service-oriented science techniques. If you are not familiar with what they are doing, I encourage you to study it: it is very impressive.
So-called virtual observatories are providing on-line access to digital sky surveys at different wavelengths, thus allowing astronomers to ask sophisticated questions from the comfort of their desks: for example, what objects are visible in the infrared but not the optical? (The answer to this query can identify candidate brown dwarfs, a class of star identified only recently.) What makes this sort of question possible is that different archives in different countries support the same service interfaces and furthermore publish information about their content into standardized registries.
I need to emphasize that while service-oriented science has tremendous potential, there are obstacles to achieving the virtuous circle of innovation that I mentioned earlier. These obstacles include not only technical concerns (how do we create, publish, register, discover services) but also methodological and policy issues. I mention three such issues here; I am sure that you can think of others.
First, by reducing barriers to accessing and using data and computational procedures, we can significantly accelerate the research process, which in turn can allow researchers to ask more questions and thus, we may hope, be more innovative. This is not in itself a problem, but does require new ways of thinking about research.
Second, as data and procedures are made available as services, they become “results” in a similar manner to data published in scientific journals: that is, scientific conclusions based on data and assumptions, and on which others may build further research. But how can those others know whether to trust the data or procedures on which they build? How do they document their assumptions? We need mechanisms for evaluating quality and documenting provenance. Otherwise we will just construct a house of cards.
Third, there is the question of how we motivate people to contribute and run services. Reward systems need to change so that researchers who do a good job of constructing services get recognized and promoted. We also need to train people to create services: arguably, we need a new class of “data scientists” expert in these issues. Finally—and here is where we get back to Grid—we need substantial new infrastructure to host services. Let me explain why.
Here is a somewhat simplistic view of a virtual observatory. Let us assume that we have configured what is now a rather small digital sky survey, the Sloan, some 10 terabytes in size, to run on our small local server. Initially, we and our users are delighted: astronomers around the world can use their Web browsers to retrieve data about individual astronomical objects. However, we soon find that astronomers are writing programs that ask more complex questions, involving perhaps tens of thousands of objects. And then the number of people asking questions increases. Suddenly we need many many computers to meet demand, and that is not something that our small group is set up to handle.
Such issues point to a new role for the traditional supercomputer center, as a hoster of services. I will illustrate how this can work by describing a service we have constructed at Chicago, in collaboration with some astronomers. The problem we have addressed is that of stacking images from different areas of the sky, something one does to improve signal to noise ratios when looking at, for example, quasars. One may want to access tens of thousands of cutouts from different areas in the sky, which is both a data-intensive and a computation-intensive task.
We have built a service to perform this function that runs on the TeraGrid. This service is constructed to acquire and release resources dynamically as load varies, thus allowing it to provide good response times regardless of load. To give an idea of the revolutionary impact such technologies can have, we are able to perform in 3 minutes a stacking that previously took a postdoc 3 months. The need for such services is going to explode in the coming years, as data volumes increase, the analyses performed on that data become more sophisticated, and users become more comfortable with service-oriented approaches to science.
To summarize, I have addressed three issues in my talk. First, the broader context, which is the impact of technological exponentials on scientific methodologies and organizations, and the consequent need for new information technology. Second, the important role that Grid plays as an unifying concept and technology for applications that require the federation of distributed resources, and the successes that have been achieved in using Grid technologies to enable on-demand access to computers, storage, data, and other resources. Third, the significance of the transition that we are currently seeing to service-oriented science, which I think has profound implications for what it means to be creative, to communicate scientific results, and to build infrastructure for science.
Some technology news that will concern only Web Services enthusiasts. But good news, nonetheless. In short: we may be nearing the end of the odyssey that started back in 2001 when we released the Web Services specification for managing state called Open Grid Services Infrastructure (OGSI).
Like Ulysses, we didn't plan on an Odyssey: our ambitions with OGSI were to define basic mechanisms as a first step towards more interesting work. However, first some people didn't like our aggressive use of WSDL 2.0 features (in retrospect a mistake, as WSDL 2.0 still isn't widely supported), which spurred the definition of WS Resource Framework (WSRF). Then industry politics led to the competing WS-Transfer specifications.
But finally sanity seems to have prevailed. Microsoft, IBM, and HP just released the new WS-ResourceTransfer (WS-RT) specification, bringing WSRF WS-ResourceProperties functionality into the WS-Transfer universe. This specification seems to provide all of the WS-ResourceProperties functionality used in Globus Toolkit version 4 (GT4): in particular, GetResourceProperty, GetMultipleResourceProperties, and QueryResourceProperties functionality. It also seems straightforward to integrate notification, which will be done in a future WS-EventNotification spec. There is even a Create operation, included in OGSI but not in WSRF.
In summary, WS-RT seems to provide what we need in Globus, and in a manner consistent with WSRF/WSRP. Assuming WS-EventNotifcation does the obvious things, then going from WSRF to these new specifications should be fairly straightforward.
I don't imagine that the Globus community will rush to adopt these specifications, but I imagine that we will want to implement them in the not too distant future, so that people who want to work with them can do so.
I just read "The Report of the American Council of Learned Societies Commission on Cyberinfrastructure for Humanities and Social Sciences." (Quite a mouthful.) As the report says:
Science and engineering have made great strides in using information technology to understand and shape the world around us. This report is focused on how these same technologies could help advance the study and interpretation of the vastly more messy and idiosyncratic realm of human experience.
This is a fascinating and compelling ambition and vision. However, while I enjoyed reading the report, I thought it could have said much more about how to achieve that goal.
One new insight (probably obvious to most others) that I gained from the report was the extent to which, in contrast to at least most science and engineering (maybe species diversity is an exception, and astronomy due to the large amateur astronomy community), the humanities need cyberinfrastructure not simply to enable innovative research approaches, but also for purposes of preservation and access (in their case, of/to the human cultural record).
Much of the report is concerned with the latter topic. It makes a strong case for investment in the creation and maintenance of collections, and for openness in access and standards. It is hard to disagree with these conclusions. On the other hand, there is little consideration given to how to prioritize such work given scarce resources--a question that presumably should depend in part on what
are viewed as research priorities.
The Commision's charge included these questions:
What are the "grand challenge" problems for the humanities and social sciences in the coming decade? Are they tractable to computation?
The answers to these questions seem critical to the future of not only the humanities and social sciences but also (if we believe that the humanities and social sciences are relevant to society) to
humanity. Unfortunately, we do not find these answers in this report. Nor do we learn which aspects of cyberinfrastructure, and investigative approaches, are most likely to be useful.
The report does make some interesting remarks on the wide variety of methods that may be applicable:
The activity of discovering and interpreting patterns in large collections of digital information is often called data-mining (or sometimes, when it is confined to text, text-mining), but data-mining is only one investigative method, or class of methods, that will become more useful in the humanities and the social sciences as we bring greater computing power to bear on larger and larger collections, and more complex research questions, often with outcomes in areas other than that for which the data was originally collected. Beyond data mining, there are many other ways of animating and exploring the integrated cultural record. They include simulations that reverse-engineer historical events to understand what caused them and how things might have turned out differently; game-play that allows us to tinker with the creation and reception of works of art; role-playing in social situations with autonomous agents, or using virtual worlds to understand behavior in the real world.
A broad and exciting list. But in the absence of defined research priorities for the humanities and social sciences, and an understanding of where those prioritized research tasks can benefit from computation, we can't even start to discuss which of these techniques are most important to pursue.