I participated on Thursday in a panel at the HPC Users Conference, run by the U.S. "Council on Competitiveness." I spoke on how the U.S. national laboratories can partner with companies in a mutually beneficial way. The panel reinforced for me some important points that I think need to be more broadly appreciated:
High-performance computing (HPC) is increasingly central to competitiveness, not just in traditional areas like aerospace and automotive, but also in new areas like corporate data mining and consumer product design. (Amusing success story: Procter & Gamble used a supercomputer to study the airflow over its Pringles potato chips to help stop them from fluttering off the company's assembly lines.)
Successful lab-industry partnerships can be about far more than access to supercomputers--they can involve codevelopment of advanced software systems. For example, Terry Talley spoke about how Axciom (they have your credit card data, if you live in the U.S.) had worked for four years with the PVFS team at Argonne.
The two industrial participants in the panel were interesting. Terry Talley talked about how Axciom is using Grid computing internally. The CTO from DreamWorks talked about the amount of computing involved in modern animated features: 10,000,000 CPU hours for Shrek1, 15,000,000 CPU hours for Shrek2, and so on. He also talked about how they are using DOE supercomputers in an exploration of interactive (instead of overnight) rendering. So even the most advanced users can imagine using computers in far more powerful ways.
I was asked to speak on "Scientific Impact of Grid Computing." I enclose my talk below.
Scientific Impact of Grid
The subject of my remarks today is the impact of grid
computing on science. I will first provide some context, reviewing how the
nature of science is changing as a result of (among other things) technological
developments. I will then explain the relevance of Grid technologies to these
developments and review experiences to date with the application of those
technologies, and finally I will talk about how the advent of service oriented
approaches promises (in my view) to transform many aspects of scientific
research in the future.
First, context. We are talking today about grid and
science for a reason, and that reason is the sustained exponential change in
technology that has over the past 50 years been producing ever more data,
enabling ever more computing, and connecting us all ever more closely.
The consequence of these developments is not only
quantitative but also qualitative changes in how we tackle some of the most
challenging and urgent scientific problems of our age, from climate change to
disease. Increasingly, research involves the analysis of large quantities of
data, large-scale numerical simulation, and intensive and interdisciplinary
collaboration. The technologies that we used previously to store, transmit, process,
and communicate data—workstations, DAT tapes, Fedex, even scientific journals,
some would argue—are no longer as effective as they were.
We also see the emergence of new research methodologies
and organizational structures, as evidenced by this image of the collaboration
that is building the Large Hadron Collider at CERN. In this project, which is
not so different in broad strokes from those that sequenced the human genome or
that managed the response to SARS, we have different overlapping groups of
varying sizes, some sharing data, some competing, all ultimately contributing
to the solution of the problem at hand.
These developments have many profound implications for
research methodologies, education, resource allocations, and so forth. In
particular, they demand information technology infrastructures, and Grid is
part of this emerging new technology landscape.
In this context, then, Grid has come to play a valuable
role as a unifying concept and technology for applications that require the
federation of resources (computers, storage, data, people, etc.). Why the name
“grid”? Having bought a new rice cooker, we simply plug it in: the power grid
obviates the need to also buy and install a new electrical generator. By
analogy, information technologists refer to “the grid” when talking about
Like its namesake, a grid is a mix of technology,
infrastructure, and standards. The technology is software that allows
resource providers to federate computers, storage, data, networks, and other
resources, and for resource consumers to harness those federated resources when
needed. We can categorize this software as “system-level” (software that
implements common management interfaces to underlying resources, such as the
open source Globus software that I have been involved in developing) and
user-level (such as the Ninf software from Japan). Together, these software
bridge the gap from applications to resources.
The infrastructure comprises the physical resources
and services that must be maintained and operated for this resource federation
and access to occur. Examples of services include registries and certificate
Finally, the standards codify the messages that
must be exchanged, and the policies that must be followed, to achieve those
Together, this technology, infrastructure, and standards
allow us to bridge the otherwise substantial gap between applications and the
physical computers, storage systems, networks, and other devices that those
applications need to operate.
Let us move on now to a how Grid technology is being used
in science. Today, this use is primarily directed at enabling, as I have
indicated, on-demand access to computing, storage, and other devices. For
example, the U.S. Network for Earthquake Engineering Simulation (NEES) implements
service interfaces that allow for remote access to, and monitoring and control
of, experimental apparatus for earthquake engineering as well as simulation
codes and data archives. NEES has been used to conduct distributed hybrid
experiments, in which components of a large structure are tested via a mix of
numerical simulation and physical simulation at different sites. This is a
technique pioneered in Japan,
by the way. NEES is transforming the nature of earthquake engineering research
in the U.S.
The Earth System Grid provides access to large climate
model datasets such as those produced by the International Panel on Climate
Change assessment. The substantial impact of this service on the climate
research community is indicated by the large number of users, the number of
data downloaded, and the number of resulting research articles.
The TeraGrid is the premier U.S. “cyberinfrastructure,” to
use a term popular in the U.S. TeraGrid links supercomputers and storage
systems at eight sites with an extremely fast network, and deploys standard
Grid software across these resources so that scientists can obtain large
amounts of computation and storage when required to support their science.
By thus standardizing on interfaces and policies, TeraGrid
seeks to transform its diverse sites and computers into interchangeable
providers of computing power. An application (for example, a medical data
analysis application) can then acquire needed computing, storage, and network
capacity to achieve its scientific objectives.
Increasingly, TeraGrid is being viewed as a system that does
not simply provide computing resources for individual scientists, but also hosts
services for communities. This emerging new role is significant and I believe
will result in a considerably greater impact on the scientific community.
For example, PUMA is an information system that provides
access to data computed by integrating genomic and proteomic data. To its
several thousand users, it is simply a Web site. However, behind the scenes,
PUMA is making extensive use of TeraGrid and other Grid infrastructures to
perform its data integration. Indeed, PUMA code routinely runs on 1000
processors when integrated new data.
As these examples show, Grid as a technology for on-demand
access to computing is already widely deployed, and is having a significant
impact on science in numerous fields. Nevertheless, I believe that these
successes are only a first step towards a far greater impact on science. This
leads me to the third part of this talk, in which I discuss what I see as the
next major thrust for grid computing and for science as a whole.
In traditional approaches to research, communication among
researchers occurs primarily via publication in peer-reviewed journals. Information
technology may play a role as a tool during the research process, but does not
change the nature of this communication process.
What I call service-oriented science adds a new modality
of communication, namely the creation of computational services—that is,
network-accessible programs that implement a convenient interface and that
provide access to data and/or computational capabilities.
Such services allow for new research methodologies, as
follows. Someone publishes a service: for example, PUMA, which, as I described
earlier, provides access to derived data products—or, perhaps, to an enhanced
PUMA that allows its clients to supply their own genomic data to be integrated
with that maintained by PUMA.
Another researcher discovers that service and uses it in
their research. In a first instance, they may simply query PUMA from their Web
browser. However, as they get more ambitious, they may also compose calls to
PUMA with calls to other services (for example, a service for computing
metabolic pathways) in what we call a workflow. In this way, they can scale up
dramatically the number of questions that they can ask and get answered. This
automation of data analysis tasks is an important consequence of
Even more interesting is what can happen next. The
researcher may decide that this workflow that they have developed captures a
broadly useful analysis technique, and decide to publish that workflow as a new
service that may itself be discovered and called by others. Thus we may achieve
a virtuous circle of innovation.
The astronomers have been pioneers in the adoption of
service-oriented science techniques. If you are not familiar with what they are
doing, I encourage you to study it: it is very impressive.
So-called virtual observatories are providing on-line
access to digital sky surveys at different wavelengths, thus allowing
astronomers to ask sophisticated questions from the comfort of their desks: for
example, what objects are visible in the infrared but not the optical? (The
answer to this query can identify candidate brown dwarfs, a class of star
identified only recently.) What makes this sort of question possible is that
different archives in different countries support the same service interfaces
and furthermore publish information about their content into standardized
I need to emphasize that while service-oriented science
has tremendous potential, there are obstacles to achieving the virtuous circle
of innovation that I mentioned earlier. These obstacles include not only
technical concerns (how do we create, publish, register, discover services) but
also methodological and policy issues. I mention three such issues here; I am
sure that you can think of others.
First, by reducing barriers to accessing and using data
and computational procedures, we can significantly accelerate the research
process, which in turn can allow researchers to ask more questions and thus, we
may hope, be more innovative. This is not in itself a problem, but does require
new ways of thinking about research.
Second, as data and procedures are made available as
services, they become “results” in a similar manner to data published in
scientific journals: that is, scientific conclusions based on data and
assumptions, and on which others may build further research. But how can those
others know whether to trust the data or procedures on which they build? How do
they document their assumptions? We need mechanisms for evaluating quality and
documenting provenance. Otherwise we will just construct a house of cards.
Third, there is the question of how we motivate people to
contribute and run services. Reward systems need to change so that researchers
who do a good job of constructing services get recognized and promoted. We also
need to train people to create services: arguably, we need a new class of “data
scientists” expert in these issues. Finally—and here is where we get back to
Grid—we need substantial new infrastructure to host services. Let me explain
Here is a somewhat simplistic view of a virtual
observatory. Let us assume that we have configured what is now a rather small
digital sky survey, the Sloan, some 10 terabytes in size, to run on our small
local server. Initially, we and our users are delighted: astronomers around the
world can use their Web browsers to retrieve data about individual astronomical
objects. However, we soon find that astronomers are writing programs that ask
more complex questions, involving perhaps tens of thousands of objects. And
then the number of people asking questions increases. Suddenly we need many
many computers to meet demand, and that is not something that our small group
is set up to handle.
Such issues point to a new role for the traditional
supercomputer center, as a hoster of services. I will illustrate how this can
work by describing a service we have constructed at Chicago, in collaboration with some
astronomers. The problem we have addressed is that of stacking images from
different areas of the sky, something one does to improve signal to noise
ratios when looking at, for example, quasars. One may want to access tens of
thousands of cutouts from different areas in the sky, which is both a
data-intensive and a computation-intensive task.
We have built a service to perform this function that runs
on the TeraGrid. This service is constructed to acquire and release resources
dynamically as load varies, thus allowing it to provide good response times
regardless of load. To give an idea of the revolutionary impact such
technologies can have, we are able to perform in 3 minutes a stacking that
previously took a postdoc 3 months. The need for such services is going to
explode in the coming years, as data volumes increase, the analyses performed
on that data become more sophisticated, and users become more comfortable with
service-oriented approaches to science.
To summarize, I have addressed three issues in my talk.
First, the broader context, which is the impact of technological exponentials
on scientific methodologies and organizations, and the consequent need for new
information technology. Second, the important role that Grid plays as an
unifying concept and technology for applications that require the federation of
distributed resources, and the successes that have been achieved in using Grid
technologies to enable on-demand access to computers, storage, data, and other
resources. Third, the significance of the transition that we are currently
seeing to service-oriented science, which I think has profound implications for
what it means to be creative, to communicate scientific results, and to build
infrastructure for science.
Some technology news that will concern only Web Services enthusiasts. But good news, nonetheless. In short: we may be nearing the end of the odyssey that started back in 2001 when we released the Web Services specification for managing state called Open Grid Services Infrastructure (OGSI).
Like Ulysses, we didn't plan on an Odyssey: our ambitions with OGSI were to define basic mechanisms as a first step towards more interesting work. However, first some people didn't like our aggressive use of WSDL 2.0 features (in retrospect a mistake, as WSDL 2.0 still isn't widely supported), which spurred the definition of WS Resource Framework (WSRF). Then industry politics led to the competing WS-Transfer specifications.
But finally sanity seems to have prevailed. Microsoft, IBM, and HP just released the new WS-ResourceTransfer (WS-RT) specification, bringing WSRF WS-ResourceProperties functionality into the WS-Transfer universe. This specification seems to provide all of the WS-ResourceProperties functionality used in Globus Toolkit version 4 (GT4): in particular, GetResourceProperty, GetMultipleResourceProperties, and QueryResourceProperties functionality. It also seems straightforward to integrate notification, which will be done in a future WS-EventNotification spec. There is even a Create operation, included in OGSI but not in WSRF.
In summary, WS-RT seems to provide what we need in Globus, and in a manner consistent with WSRF/WSRP. Assuming WS-EventNotifcation does the obvious things, then going from WSRF to these new specifications should be fairly straightforward.
I don't imagine that the Globus community will rush to adopt these specifications, but I imagine that we will want to implement them in the not too distant future, so that people who want to work with them can do so.
Science and engineering have made great strides in using information technology to understand and shape the world around us. This report is focused on how these same technologies could help advance the study and interpretation of the vastly more messy and idiosyncratic realm of human experience.
This is a fascinating and compelling ambition and vision. However, while I enjoyed reading the report, I thought it could have said much more about how to achieve that goal.
One new insight (probably obvious to most others) that I gained from the report was the extent to which, in contrast to at least most science and engineering (maybe species diversity is an exception, and astronomy due to the large amateur astronomy community), the humanities need cyberinfrastructure not simply to enable innovative research approaches, but also for purposes of preservation and access (in their case, of/to the human cultural record).
Much of the report is concerned with the latter topic. It makes a strong case for investment in the creation and maintenance of collections, and for openness in access and standards. It is hard to disagree with these conclusions. On the other hand, there is little consideration given to how to prioritize such work given scarce resources--a question that presumably should depend in part on what are viewed as research priorities.
The Commision's charge included these questions:
What are the "grand challenge" problems for the humanities and social sciences in the coming decade? Are they tractable to computation?
The answers to these questions seem critical to the future of not only the humanities and social sciences but also (if we believe that the humanities and social sciences are relevant to society) to humanity. Unfortunately, we do not find these answers in this report. Nor do we learn which aspects of cyberinfrastructure, and investigative approaches, are most likely to be useful.
The report does make some interesting remarks on the wide variety of methods that may be applicable:
The activity of discovering and interpreting patterns in large collections of digital information is often called data-mining (or sometimes, when it is confined to text, text-mining), but data-mining is only one investigative method, or class of methods, that will become more useful in the humanities and the social sciences as we bring greater computing power to bear on larger and larger collections, and more complex research questions, often with outcomes in areas other than that for which the data was originally collected. Beyond data mining, there are many other ways of animating and exploring the integrated cultural record. They include simulations that reverse-engineer historical events to understand what caused them and how things might have turned out differently; game-play that allows us to tinker with the creation and reception of works of art; role-playing in social situations with autonomous agents, or using virtual worlds to understand behavior in the real world.
A broad and exciting list. But in the absence of defined research priorities for the humanities and social sciences, and an understanding of where those prioritized research tasks can benefit from computation, we can't even start to discuss which of these techniques are most important to pursue.