My Photo

« August 2006 | Main | October 2006 »

September 30, 2006

Mapping in Time and Place

I had an interesting conversation today with Michael Buckland about the importance of mapping historical cultural data to time and place. Most documents refer to place names, which may be ambiguous (e.g., country names come and go, town names change or are reused), and refer to time in similarly ambiguous ways (e.g., "last year", "during the summer", "when I was 10", "after the war"). If such references can be disambiguated, then it becomes possible to see connections that might not otherwise be visible.

Michael Buckland directs the Electronic Cultural Atlas Initiative (ECAI) an international project to develop and distribute digital data on historical and archaeological resources. To this end, they are working to "create digital maps that display a wide range of cultural material by using place and time as a common element."

Apparently current Geographical Information System (GIS) tools just don't deal with time in an adequate way. One exception is the University of Sydney's TimeMap system, which ECAI uses.

I've always loved maps, and we are seeing from recent innovations such as Google Maps just how powerful it can be to enable easy mapping of diverse data to geographical space. But I had never thought about the temporal dimension.

Ecaitop520

September 29, 2006

New Zealand Gets Wired

Logo Having grown up in New Zealand, I am delighted that the country finally has a high-speed research and education network, the Kiwi Advanced Research and Education Network (KAREN). Officially launched on August 31, this network links all of the major research institutions via a 10 Gbit/sec backbone.

The creation of a decent research infrastructure for New Zealand has taken a while. It's always going to be a challenge linking a country in which just 4 million people are spread over a fairly large area. However, while New Zealand has long had a high penetration of Internet technologies, things have been made worse by a lack of investment in research over the past 20 years, and by policies that have encouraged competition rather than cooperation among research universities and laboratories. Fortunately, these policies seem to be changing.

I've been thinking about these things since 2004, when I visited New Zealand and gave a series of talks to people involved in planning research infrastructure. I quoted Woody Allen: "80% of success is showing up", and pointed out that while the world is shrinking rapidly, it is not doing so uniformly. I noted that in 2004, I could send 1 terabyte (1 trillion bytes) to Geneva from Chicago in 20 minutes, but it took me four hours to download 1 megabyte (1 million bytes) from Chicago to Wellington. This difference reflects what we might call the dirty underside of exponentials: if network speeds are doubling every nine months, then a mere 10 years lag in network deployment means you are 10,000x slower than the competition. And in a world where one's ability to compete depends on access to information and colleagues, that difference can be fatal. Thus it's exciting to see that New Zealand has caught up--at least for a while.

I also spoke during that visit of the limiting effect of what I termed "PC Science," i.e., science scaled to fit on one's personal computer. Such limited approaches constrain the questions asked and the answers obtained. They can also (I fear) limit one's ability to enlist the best students, who are looking for things that are exciting and cutting edge. Fortunately, once you have high-speed networks, it becomes far more feasible to link users with clusters, supercomputers, databases, and collections of PCs to provide access to powerful computational capabilities. Thus I am also pleased to see my alma mater, the University of Canterbury, acquire a powerful supercomputer.

September 26, 2006

The Many Faces of IT as Service

While trying to define Grid may well be a hopeless task, it is certainly useful and feasible to talk about the different elements of the service-oriented ecosystem. That is what Steve Tuecke and I did in a recent article, "Describing the Elephant: The Different Faces of IT as Service."

The introduction to this article explains what it is about:

In a well-known story, a group of blind men are asked to describe an elephant. Each encounters a different part of the animal, and not surprisingly provides a different description.

We see a similar degree of confusion in the IT industry today, as terms like service-oriented architecture, Grid, utility computing, on-demand, adaptive enterprise, data center automation, and virtualization are bandied about. As when listening to the blind men, it can be hard to know what reality lies behind the words, whether and how the different pieces fit together, and what we should be doing about the animal(s) that are being described. (Of course, in the case of the blind men, we did not also have marketing departments in the mix!)

Our goal in this article is to shed some light on these matters and provide, in effect, a description of the elephant. More specifically, we describe what we see as a major technology trend that is driving many related efforts, namely the transformation from vertically integrated silos to horizontally integrated, service-oriented systems. We explain how various popular terms relate to this overarching trend, and describe the technology required to realize this transformation.

As does the summary:

We have argued that SOA, grid, on-demand, utility computing, software as service, and other related terms all represent different perspectives on the same overall goal—namely, the restructuring of enterprise IT as a horizontally integrated, service-oriented architecture. If successfully realized, that goal will see in-house, third-party, and outsourced applications all operating in a uniform environment, with on-demand provisioning of both in-house and outsourced hardware resources—and also, of course, high degrees of security, monitoring, auditing, and management.

This Holy Grail of open, standards-based, autonomically managed software and dynamically provisioned hardware has certainly not yet been achieved. That does not mean, however, that enterprises cannot start today to create horizontally integrated, service-oriented infrastructures. Solid Web services products allow for the creation of service-oriented applications. Mature commercial and open source virtualization and workload management products and open source grid infrastructure software provide what is needed to create horizontally integrated infrastructure to sit behind those applications. Integration remains more of an exercise for the customers (or their services vendors) than is desirable, but that situation should change as independent software vendors start to grid-enable their products. Meanwhile, progress on further standards is accelerating as experience is gained with deployments and pressure builds from end users for interoperable solutions.

September 24, 2006

$60M per Year for Scientitic Discovery through Advanced Computing

The U.S. Department of Energy's Office of Science announced on September 7th its awards for the next phase of its "Scientific Discovery through Advanced Computing" (SciDAC) program. This is the major DOE program that funds research in computational science and tools, and is by several measures the most significant program in the world focused on high end computing for science.

This new program will spend $60M per year over the next three to five years on "projects aimed at accelerating research in designing new materials, developing future energy sources, studying global climate change, improving environmental cleanup methods and understanding physics from the tiniest particles to the massive explosions of supernovae." These projects will make use of amazing new computational facilities at Argonne, Oak Ridge, and Lawrence Berkeley National Laboratories, capable of computational rates of 100s of teraflop/s. The scientific goals of these

Usa_labs_univ2med

projects are truely remarkable in their ambitions and implications: it's well worth browsing the list to see what they are up to. It's also interesting to see where SciDAC researchers are located (see figure).

SciDAC emphasizes numerical simulation and supercomputers, but there is clearly also a growing recognition of the importance of linking both supercomputers and experimental facilities with the communities of scientists that must ultimately make sense of the petabytes of data produced by simulations and experiments. Thus, SciDAC-2 includes three projects focused on distributed data:

I and my colleagues in the Computation Institute at Argonne National Laboratory and the University of Chicago are involved in all three of these projects.

It's sobering to see that DOE funded only 30 out of 240 proposals. Given the exceptional quality of the people and ideas in many of the 210 proposals that were not funded, one is left keenly aware of the tremendous potential that remains untapped. Let's hope those ideas can be supported by other programs.

September 23, 2006

What is the Grid, anyway?

I was recently asked to provide a definition of "Grid" for the layman. I wrote a piece a while back on "what is the grid." I still like that definition--although I've also decided that trying to define such things is a hopeless task. But here goes another attempt.

Having bought a new toaster, we simply plug it in: the electric power grid obviates the need to also buy and install a new generator. By analogy, information technologists refer to "the grid" when talking about on-demand computing.

Like its namesake, a grid is a mix of technology, infrastructure, and standards. The technology is software that allows resource providers (whether individuals or institutions) to federate computers, storage, data, networks, and other resources, and for resource consumers to harness those federated resources when needed. The infrastructure comprises the physical hardware and services that must be maintained and operated for this resource federation and access to occur. Finally, standards codify the messages that must be exchanged, and the policies that must be followed, to achieve those goals.

There is a subtle but important distinction between "a grid" and "the Grid." Any system that allows for resource federation and on-demand access is arguably a "grid", whether general-purpose or application-specific, small or large. The Grid, like "the Internet", denotes the global set of computers that speak the same protocols.  In that sense, "the Grid" is a work in progress, as relevant standards continue to be codified and adopted.

September 21, 2006

Data Analysis Challenges

An important trend with broad implications is the extent to which data analysis tasks are becoming computationally demanding. The problem is that data volumes are growing exponentially, driven by Moore's law; meanwhile, many interesting analyses depend on the intercomparison of data items, and thus have a cost that grows faster than linearly with the amount of data. Thus even exponentially improving processors can't keep up. The fact that storage costs are currently decreasing faster than computing costs makes things worse.

Folker_figure1_2 We see the impact of these issues in the attached figure, from a nice article by Folker Meyer of the Argonne/Chicago Computation Institute in CTWatch, with contrasts the number of genetic sequences obtained with the number of annotations generated. The issues here are not solely computational, as many annotatons are generated manually. But nevertheless, it is striking to see how fast we're falling behind.

As always, a solution to this problem will need to combine improvements in hardware, software, and algorithms:

  • Hardware: Because individual devices aren't getting faster particularly rapidly, we will see increasing parallelism in storage, computers, and networks. We hear about these trends a great deal at places like Google, but it is becoming widespread.
  • Software: As the number of devices and the amount of work to do both increases, software needs to get smarter. We need to orchestrate massively parallel computations across many devices and manage the flow of data into and out of (and amomg) those computations--and, wherever possible, avoid performing computations by caching and other techniques.
  • Algorithms: Neither hardware nor software improvements can overcome the basic exponentials. Thus we need better algorithms. Probablistic algorithms that perform sampling to extract "good enough" knowledge will become important. So will the ability to evaluate how "good" a particular conclusion really is.

September 20, 2006

Earth System Grid

I'm at the kickoff meeting for the next phase of the Globus-based Earth System Grid (ESG), a U.S. Department of Energy project developing technology to manage and provide access to large quantities of climate simulation data. The two ESG  portals provide access to more than 100 terabytes of output from U.S. and international climate models. The 4000 registered users have so far downloaded more than 130 terabytes of data as they ask questions such as "why are hurricane intensities increasing." Just last year, these users produced more than 300 scientific papers based on ESG data.

In the next phase, we face big challenges as the  quantity of data increases (new petaflop/s computers will generate 10-100 more data), data becomes more distributed (it can't all be moved to a central location, as at present), the user population becomes larger and more diverse (including, e.g., policy analysts as well as climate scientists), and the sophistication of the data analyses to be performed increases.

One important trend will be increased focus on server-side analysis: as data volumes increase, users must be able to request that data be processed at the data location rather than downloaded to their local system. They need access to data analysis services as well as data download functions, so that they can ask "compare the power spectrum of sea surface temperature in the Nino-3 region from these 10 models" rather than "download ocean temperature data for those models for a 100-year simulation period." Needless to say, server-side analysis of petabytes of data is not easy. We'll be working in the coming months to add such capabilities to ESG.

If you want to learn more, here is a fairly recent article on ESG architecture and implementation. Globus technology is used for data access, authentication and authorization, distributed system monitoring, and other purposes.

I see ESG as a premier example of service-oriented science--and also a success story for Grid technology.

September 14, 2006

Service-Oriented Science

I mentioned in a recent post an article on "service-oriented science." Here is the abstract. I am certainly not the first to think or express these ideas, but I hope I can spur some discussion on their significance.

New information architectures enable new approaches to publishing and accessing valuable data and programs. So-called service-oriented architectures define standard interfaces and protocols that allow developers to encapsulate information tools as services that clients can access without knowledge of, or control over, their internal workings. Thus, tools formerly accessible only to the specialist can be made available to all; previously manual data-processing and analysis tasks can be automated by having services access services. Such service-oriented approaches to science are already being applied successfully, in some cases at substantial scales, but much more effort is required before these approaches are applied routinely across many disciplines. Grid technologies can accelerate the development and adoption of service-oriented science by enabling a separation of concerns between discipline-specific content and domain-independent software and hardware infrastructure.

September 13, 2006

Free Books

While writing a book on parallel programming in 1993, I saw an early demonstration of Mosaic, and immediately realized that the book should be published online. After some inspired hacking of latex2html by my colleague Brian Toonen, "Designing and Building Parallel Programs" (DBPP) was published simultaneously by Addison-Wesley and at www.mcs.anl.gov/dbpp in early 1994. This must have been one of the first books published on the Web. For a while, it accounted for a third or more of Argonne National Laboratory's web traffic.

What got me thinking about this ancient history is the following text:

The National Academies Press has for some time now been distributing the content of its monographs free on the web, and (thanks in part to a carefully thought-out strategy for doing that) it has seen its sales of print increase dramatically.

I've always thought that making DBPP available  online must have increased sales. At least that is what I convinced my editor at Addison-Wesley would happen. However, I've never seen any relevant data. While the National Academies Press doesn't provide data (or explain their "carefully thought-out strategy": which sounds a bit like a cunning plan), this surely counts as anecdotal evidence.

September 12, 2006

Globus turns 10: Time for Celebration and Reflection

The following is the text of an article that I wrote for GridToday on Globus' 10th birthday, which we celebrated yesterday in Washington DC.


Globus Turns 10: Time for Celebration and Reflection

The GlobusWORLD conference being held (jointly with GridWorld and the Open Grid Forum) this week in Washington, D.C., is a significant milestone for those involved in the development and use of the Globus open source Grid software. The reason is that it was 10 years ago (to be precise, on Aug. 21, 1996) that Carl Kesselman and I received our first funding for work on Globus, from DARPA. Gary Minden and Mike St. Johns were our enlightened program managers, followed by Gary Koob. I must also recognize the support of Bob Aiken, Tom Kitchens and, especially, Mary Anne Scott, then all at DoE.

Given this milestone, I will spend some time here recapping history and reflecting on where we have come and what we have learned.

A Little History

10 years is a long time: What on earth have we been doing over that period? Let's revisit some of the highlights.

The emergence of high-speed networks in the 1990s led to an awareness that the Internet could allow for more interesting applications than e-mail and file transfer. (Len Kleinrock had envisioned this possibility back in 1969, but it took a while to get there!) Efforts like the U.S. Gigabit testbed project, led by Bob Kahn, and the Supercomputing'95 I-WAY effort, led by Tom DeFanti and Rick Stevens, helped build awareness of these opportunities. This era also saw pioneering efforts such as the NSF Metacenter, led by Charlie Catlett and Larry Smarr, and Legion, led by Andrew Grimshaw. However, for the most part, every application was constructed from scratch.

We (in particular, myself, Carl and Steve Tuecke) studied this situation and saw a need for standards and software (middleware) to bridge the gap between applications and the complexities of a distributed resource environment. Thus, we started a research project aimed at defining this middleware. Believing strongly that we did not necessarily know the real problems, we started an iterative process of examining the requirements of collaborative communities, prototyping solutions to their problems and feeding back the resulting experiences into a next cycle of research and development. We called this project Globus because it built on earlier technology called "Nexus" and had global goals.

Back in 1996, our ambitions and the needs of our users were far greater than our resources -- a situation that persists today! -- and so it was challenging to develop software that was sufficiently stable and functional to allow for meaningful experiments. Fortunately, we found wonderful application partners -- people like Ed Seidel, Paul Messina and their colleagues, and later members of the high energy physics community -- who were prepared to work with often imperfect software and provide invaluable feedback.

Along the way, we achieved milestones that helped persuade ourselves and others that we had something useful. For example, 1998 saw Sharon Brunett, Karl Czajkowski and others achieve a record-setting military simulation involving 100,298 vehicles distributed over 13 supercomputers at nine sites. Gregor von Laszewski and others demonstrated real-time analysis of data from the Advanced Photon Source. At the SC'98 conference, we demonstrated the "Globus Ubiquitous Supercomputing Testbed Organization" (GUSTO) that spanned some 50 sites worldwide. NASA launched its Information Power Grid project, under the leadership of Bill Johnston.

By 2001, the year in which the TeraGrid was founded, we had software we felt was ready to operate in production environments, if only we could find friendly sites prepared to perform the needed integration, and application scientists ready to develop the necessary application software. In practice, we weren't as ready as we thought we were, but nevertheless we entered a stage -- of learning via experience about the mechanisms and policies required for operational use -- that to some extent continues today. We also received some nice recognition at this time: Globus Toolkit version 2 (GT2) played a key role in a Gordon Bell prize awarded at SC'01 to an astrophysics application that used Cactus, MPICH-G2 and Globus. The following year, R&D Magazine recognized GT2 with an R&D 100 award and named it the "most promising new technology" of the year.

In late 2001, IBM followed up its dramatic open source Linux strategy announcement with a similar announcement about the importance of Grid technologies. We were thrilled when IBM elected to work with us to develop the OGSI Web Services specification and the corresponding Globus implementation, which was released in 2003 as GT3. While this first Web services release provided only modest quality, it spurred much innovative work, such as the video distribution system developed by the Belfast eScience Center for the BBC (to give an idea of the scale of effort underway by this time, BeSC applications alone totaled 1.5 million lines of GT3 code, later adapted for GT4).

2005 saw the release of Globus Toolkit version 4 (GT4), which, thanks to the efforts of talented developers and the able leadership of Lisa Childers, exceeded all previous releases in terms of quality and rigor of both software and documentation. GT4 supports the construction of stateful and secure Web services in Java, C and Python; provides job submission, file transfer, credential management, registry and database access services; incorporates a powerful integrated security system; and provides many other features besides. 2005 and 2006 also saw significant new funding in support of the Globus science community, from the U.S. National Science Foundation's NSF Middleware Initiative (under Kevin Thompson), UK eScience program (for work on OGSA-DAI) and, most recently, from the U.S. Department of Energy's SciDAC program.

Where We Are Today

Someone once dismissed Grid as a "funding concept" -- a witty but irritating turn of phrase. I have not heard that expression lately: Grid is mainstream in both science and industry, and so many people are using Grid technology to solve real problems that it is hard to argue that it is not successful and useful. Indeed, we can make a strong case that Grid has had a significant impact on how people conceptualize and solve problems in many domains.

It is particularly pleasing to see the diversity of Globus application communities, which span, for example, astronomy (e.g., the LIGO gravitational wave observatory, the Caltech Montage service), bioinformatics (e.g., Natalia Maltsev's PUMA system), cancer biology (e.g., the National Institutes of Health's caBIG cancer bioinformatics Grid), data mining (e.g., work by Domenico Talia) and environmental science (e.g., C3grid in Germany and Earth System Grid in the United States). And that is just the first five letters of the alphabet.

I am also delighted with the geographical diversity of Globus deployments. We see substantial Globus deployments and applications in every continent except Antarctica, and just about every day I get e-mail from someone somewhere describing a new deployment of which I was not previously aware. Again, we can walk through the alphabet: Australia, Belgium, China (and Canada and Chile), Denmark, England, France, Germany, Hungary, Ireland, Japan, Korea, Luxembourg, Mexico, the Netherlands, ....

Another area in which we continue to see wonderful progress is in the range of "solutions" that leverage Globus software. Globus middleware does not address end-user requirements directly, but a wide range of Globus-based tools now existing for building portals (e.g., OGCE, GridPort, Jason Novotny and Michael Russell's GridSphere); executing workflows (e.g., Ewa Deelman and Mike Wilde's VDS, David Abramson's Nimrod, Miron Livny's Condor, BPEL); running parallel programs (e.g., Nick Karonis' MPICH-G2); delivering data (e.g., Ann Chervenak's DRS, Reagan Moore's SRB); operating instruments (e.g., Rick McMullen's Common Instrument Middleware Architecture project, GridCC in Europe); remote service invocation (e.g., Ninf in Japan); and so on. Lee Liming has done a nice job documenting these and other "solutions."

It is also pleasing to see the progress being made in industry. Steve Tuecke left Argonne in 2004 to form Univa Corp., which provides commercial support for Globus software and is building new products using Globus (disclaimer: I am also a Univa founder and advisor). They are discovering that the concerns of industry are increasingly similar to those of science, as the need to accelerate innovation processes leads to a need for dynamic resource sharing between organizational units.

I should also mention the progress made with standards. Globus contributors, notably Von Welch, played major roles in the Grid Security Infrastructure standard, which has been widely adopted. The same is true for GridFTP, under the leadership of Bill Allcock. The Job Submission Description Language (JSDL) and Basic Execution Servie (BES) specifications, which seem likely to see wide adoption, build heavily on GRAM. Globus project members, notably Frank Siebenlist, have also contributed heavily to the increasingly important WS-Security, SAML2 and XACML specifications.

It is a nice coincidence, given our anniversary, that August saw the release of the WS-ResourceTransfer specification by HP, IBM, Intel and Microsoft -- perhaps signaling the end of a standards odyssey that began in 2001 when Steve Tuecke and others defined the Open Grid Services Infrastructure (OGSI). The goal was to codify Web services mechanisms for representing and accessing state, a requirement that appeared in many different contexts. Like Ulysses, we did not know we were embarking on an Odyssey when we began. However, the release of WS-ResourceTransfer -- remarkably similar to OGSI! -- suggests that we may soon reach this journey's end.

Also worthy of celebration is the tremendous growth in the size of the Globus developer community. In the beginning, there were just three of us, plus a few partners such as Craig Lee at the Aerospace Corp. The team grew over time, as talented researchers and developers joined us at Argonne, the University of Chicago and USC Information Sciences Institute, and then other organizations partnered with us, notably the National Center for Supercomputing Applications (Jim Basney, Von Welch and others), the University of Edinburgh (Malcolm Atkinson, Neil Chue Hong, Mark Parsons and others) and PDC in Sweden (Olle Mulmo and others). Most recently, the new dev.globus development process (modeled after that of Apache Jakarta) has partitioned Globus into dozens of independent projects, each with its own developers, and opened the way for new projects to join. The response has been enthusiastic: under the leadership of Jennifer Schopf, our new incubator process already has 11 incubator projects up and running.

Reflections

We have learned a tremendous amount in the past 10 years. It is hard to know where to start in terms of summarizing lessons learned, but here are a few thoughts.

We were clearly correct in identifying large-scale collaboration as an important problem, and in choosing science as a good place to start identifying requirements and experimenting with solutions. We have seen the need to federate data and computing, orchestrate the allocation of resources to different purposes and manage the policies that govern these activities become increasingly important, first across science and now in industry too. Indeed, these questions are arguably now central to the critical question of how innovation occurs within and across organizations.

Along the way, we have learned (and I am sure must continue to relearn) the need to evolve the software and to reinvent ourselves as both user requirements and the external technology environment evolve. For example, we adopted public key security technology early: a successful step, although the configuration tools needed for convenient use have taken time to emerge. We adopted LDAP as a directory service technology: less successful, and later abandoned. In 2002, we started a major shift to Web services technology: also a positive development overall, although we were arguably premature, given the maturity of Web services technologies at the time. In the future, we will need to respond to the emergence of commercial Web services, like Amazon's S3 and EC2 services, and to other developments that we have yet to recognize.

Our decision to pursue an open source approach and a non-viral license was also clearly correct. It was not necessarily the obvious choice back in 1996, and required a lot of hard work to define the necessary licenses and get the required approvals. (I realized just how much work when a lawyer asked Steve Tuecke, who handled much of the early work on licenses, if he had considered law school!) However, this choice has allowed us to scale the development team and user community in ways that would not have been possible with a proprietary solution. Our recent move to a pure Apache license is, I hope, the final culmination of this approach.

We have struggled with numerous issues over the years relating to the fact that any large-scale collaboration (and thus a grid) is a system and, as such, involves a great diversity of software, hardware, institutions and, above all, different people: users, tool developers, application developers, operations staff, security staff and others. The result is considerable complexity in terms of requirements and also significant challenges in how requirements and capabilities are communicated to different groups.

One inevitable consequence of this complexity is that Grid and Globus are not easily characterized, and thus we have struggled to overcome various misconceptions over the years. One is that Grid is somehow an alternative to high-end computing -- rather than an essential adjunct to high-end computing, enabling remote access and the distribution of the resulting data products. Another is that a Grid is about "free computing." A third is that Globus is a turnkey solution to Grid problems. We have been careful to emphasize that Globus is middleware, not application software, but we still hear complaints that "I installed Globus, but it didn't solve my problem."

I'd also say that we didn't internalize sufficiently at the beginning the extent to which Grid was a policy and operations problem. Fortunately, we've seen some wonderful people get involved with these issues, with the result that we have become increasingly good at creating and operating grids that work. Projects like EGEE, Open Science Grid and TeraGrid have taught us a lot.

In a different space, I remain concerned by the amount of redundancy and lack of interoperability that we see across the Grid community. Given the natural human enthusiasm for novelty (often encouraged by funding agencies and commercial pressures), this diversity is not a surprise. However, I expect that convergence will occur, as people come to understand the high cost of redundant effort, and the tremendous advantages of mature, robust, open source software.

Overall, though, the current situation and future prospects are incredibly encouraging and positive. The requirements that we set out to address with Globus 10 years ago have proved to be quasi-universal. It is no longer eccentric scientists and niche communities who use Grid technology, but mainstream science communities and (increasingly) commercial users. We have a set of technologies that, while certainly not a complete solution, address key requirements. We also see convergence on standards and increasingly broad adoption of those standards in both open source and proprietary software. Finally, and most important, we have a vibrant, sometimes contentious but always enthusiastic, international community of developers and users who are committed to moving the technology forward. We should all look forward to the 20th anniversary of Globus -- by which time, if the Internet is any guide, Grid technology will be ubiquitous.

In writing this document, I have tried to acknowledge some of the many contributors to Globus software, deployments and applications. I, of course, have omitted many more names than I have included. I hope that those omitted will forgive me, and that other readers will feel inspired to learn more about individual projects and those that made them happen.

Happy 10th birthday Globus!

September 11, 2006

dev.globus explained ...

An interview with Jennifer Schopf explains the "dev.globus" community development process that we established this year for Globus software.

September 09, 2006

HPC and Competitiveness

I participated on Thursday in a panel at the HPC Users Conference, run by the U.S. "Council on Competitiveness." I spoke on how the U.S. national laboratories can partner with companies in a mutually beneficial way. The panel reinforced for me some important points that I think need to be more broadly appreciated:

  • High-performance computing (HPC) is increasingly central to competitiveness, not just in traditional areas like aerospace and automotive, but also in new areas like corporate data mining and consumer product design. (Amusing success story: Procter & Gamble used a supercomputer to study the airflow over its Pringles potato chips to help stop them from fluttering off the company's assembly lines.)
  • Successful lab-industry partnerships can be about far more than access to supercomputers--they can involve codevelopment of advanced software systems. For example, Terry Talley spoke about how Axciom (they have your credit card data, if you live in the U.S.) had worked for four years with the PVFS team at Argonne.

The two industrial participants in the panel were interesting. Terry Talley talked about how Axciom is using Grid computing internally. The CTO from DreamWorks talked about the amount of computing involved in modern animated features: 10,000,000 CPU hours for Shrek1, 15,000,000 CPU hours for Shrek2, and so on. He also talked about how they are using DOE supercomputers in an exploration of interactive (instead of overnight) rendering. So even the most advanced users can imagine using computers in far more powerful ways.

September 08, 2006

Attribute-Based Authorization on TeraGrid

Charlie Catlett writes about plans to deploy attribute-basd authorization on TeraGrid. It is neat to see people working to make national-scale authentication and authorization work.

September 06, 2006

SOS in Japan

I participated in at an interesting conference in Japan last week, "The Fusion Between Policy Science and Information and Communication Technology," that brought together social scientists and computer scientists. I particularly enjoyed a talk by John Zysman and the panel discussion on whether and how advanced computing can help policy (more on those topics later, perhaps).

I was asked to speak on "Scientific Impact of Grid Computing." I enclose my talk below.

Scientific Impact of Grid Computing

The subject of my remarks today is the impact of grid computing on science. I will first provide some context, reviewing how the nature of science is changing as a result of (among other things) technological developments. I will then explain the relevance of Grid technologies to these developments and review experiences to date with the application of those technologies, and finally I will talk about how the advent of service oriented approaches promises (in my view) to transform many aspects of scientific research in the future.

First, context. We are talking today about grid and science for a reason, and that reason is the sustained exponential change in technology that has over the past 50 years been producing ever more data, enabling ever more computing, and connecting us all ever more closely.

The consequence of these developments is not only quantitative but also qualitative changes in how we tackle some of the most challenging and urgent scientific problems of our age, from climate change to disease. Increasingly, research involves the analysis of large quantities of data, large-scale numerical simulation, and intensive and interdisciplinary collaboration. The technologies that we used previously to store, transmit, process, and communicate data—workstations, DAT tapes, Fedex, even scientific journals, some would argue—are no longer as effective as they were.

We also see the emergence of new research methodologies and organizational structures, as evidenced by this image of the collaboration that is building the Large Hadron Collider at CERN. In this project, which is not so different in broad strokes from those that sequenced the human genome or that managed the response to SARS, we have different overlapping groups of varying sizes, some sharing data, some competing, all ultimately contributing to the solution of the problem at hand.

These developments have many profound implications for research methodologies, education, resource allocations, and so forth. In particular, they demand information technology infrastructures, and Grid is part of this emerging new technology landscape.

I should also note, just as an aside—but an important aside—that as science becomes more information intensive, so the importance of computer science increases. Astronomer George Djorgovski goes so far as to claim that, “applied computer science is now playing the role that mathematics did from the 17th through the 20th centuries: providing an orderly, formal framework and exploratory apparatus for other sciences.” As a computer scientist, that message appeals to me!

In this context, then, Grid has come to play a valuable role as a unifying concept and technology for applications that require the federation of resources (computers, storage, data, people, etc.). Why the name “grid”? Having bought a new rice cooker, we simply plug it in: the power grid obviates the need to also buy and install a new electrical generator. By analogy, information technologists refer to “the grid” when talking about on-demand computing.

Like its namesake, a grid is a mix of technology, infrastructure, and standards. The technology is software that allows resource providers to federate computers, storage, data, networks, and other resources, and for resource consumers to harness those federated resources when needed. We can categorize this software as “system-level” (software that implements common management interfaces to underlying resources, such as the open source Globus software that I have been involved in developing) and user-level (such as the Ninf software from Japan). Together, these software bridge the gap from applications to resources.

The infrastructure comprises the physical resources and services that must be maintained and operated for this resource federation and access to occur. Examples of services include registries and certificate authorities.

Finally, the standards codify the messages that must be exchanged, and the policies that must be followed, to achieve those goals.

Together, this technology, infrastructure, and standards allow us to bridge the otherwise substantial gap between applications and the physical computers, storage systems, networks, and other devices that those applications need to operate.

Let us move on now to a how Grid technology is being used in science. Today, this use is primarily directed at enabling, as I have indicated, on-demand access to computing, storage, and other devices. For example, the U.S. Network for Earthquake Engineering Simulation (NEES) implements service interfaces that allow for remote access to, and monitoring and control of, experimental apparatus for earthquake engineering as well as simulation codes and data archives. NEES has been used to conduct distributed hybrid experiments, in which components of a large structure are tested via a mix of numerical simulation and physical simulation at different sites. This is a technique pioneered in Japan, by the way. NEES is transforming the nature of earthquake engineering research in the U.S.

The Earth System Grid provides access to large climate model datasets such as those produced by the International Panel on Climate Change assessment. The substantial impact of this service on the climate research community is indicated by the large number of users, the number of data downloaded, and the number of resulting research articles.

The TeraGrid is the premier U.S. “cyberinfrastructure,” to use a term popular in the U.S. TeraGrid links supercomputers and storage systems at eight sites with an extremely fast network, and deploys standard Grid software across these resources so that scientists can obtain large amounts of computation and storage when required to support their science.

By thus standardizing on interfaces and policies, TeraGrid seeks to transform its diverse sites and computers into interchangeable providers of computing power. An application (for example, a medical data analysis application) can then acquire needed computing, storage, and network capacity to achieve its scientific objectives.

Increasingly, TeraGrid is being viewed as a system that does not simply provide computing resources for individual scientists, but also hosts services for communities. This emerging new role is significant and I believe will result in a considerably greater impact on the scientific community.

For example, PUMA is an information system that provides access to data computed by integrating genomic and proteomic data. To its several thousand users, it is simply a Web site. However, behind the scenes, PUMA is making extensive use of TeraGrid and other Grid infrastructures to perform its data integration. Indeed, PUMA code routinely runs on 1000 processors when integrated new data.

As these examples show, Grid as a technology for on-demand access to computing is already widely deployed, and is having a significant impact on science in numerous fields. Nevertheless, I believe that these successes are only a first step towards a far greater impact on science. This leads me to the third part of this talk, in which I discuss what I see as the next major thrust for grid computing and for science as a whole.

In traditional approaches to research, communication among researchers occurs primarily via publication in peer-reviewed journals. Information technology may play a role as a tool during the research process, but does not change the nature of this communication process.

What I call service-oriented science adds a new modality of communication, namely the creation of computational services—that is, network-accessible programs that implement a convenient interface and that provide access to data and/or computational capabilities.

Such services allow for new research methodologies, as follows. Someone publishes a service: for example, PUMA, which, as I described earlier, provides access to derived data products—or, perhaps, to an enhanced PUMA that allows its clients to supply their own genomic data to be integrated with that maintained by PUMA.

Another researcher discovers that service and uses it in their research. In a first instance, they may simply query PUMA from their Web browser. However, as they get more ambitious, they may also compose calls to PUMA with calls to other services (for example, a service for computing metabolic pathways) in what we call a workflow. In this way, they can scale up dramatically the number of questions that they can ask and get answered. This automation of data analysis tasks is an important consequence of service-oriented science.

Even more interesting is what can happen next. The researcher may decide that this workflow that they have developed captures a broadly useful analysis technique, and decide to publish that workflow as a new service that may itself be discovered and called by others. Thus we may achieve a virtuous circle of innovation.

The astronomers have been pioneers in the adoption of service-oriented science techniques. If you are not familiar with what they are doing, I encourage you to study it: it is very impressive.

So-called virtual observatories are providing on-line access to digital sky surveys at different wavelengths, thus allowing astronomers to ask sophisticated questions from the comfort of their desks: for example, what objects are visible in the infrared but not the optical? (The answer to this query can identify candidate brown dwarfs, a class of star identified only recently.) What makes this sort of question possible is that different archives in different countries support the same service interfaces and furthermore publish information about their content into standardized registries. 

I need to emphasize that while service-oriented science has tremendous potential, there are obstacles to achieving the virtuous circle of innovation that I mentioned earlier. These obstacles include not only technical concerns (how do we create, publish, register, discover services) but also methodological and policy issues. I mention three such issues here; I am sure that you can think of others.

First, by reducing barriers to accessing and using data and computational procedures, we can significantly accelerate the research process, which in turn can allow researchers to ask more questions and thus, we may hope, be more innovative. This is not in itself a problem, but does require new ways of thinking about research.

Second, as data and procedures are made available as services, they become “results” in a similar manner to data published in scientific journals: that is, scientific conclusions based on data and assumptions, and on which others may build further research. But how can those others know whether to trust the data or procedures on which they build? How do they document their assumptions? We need mechanisms for evaluating quality and documenting provenance. Otherwise we will just construct a house of cards.

Third, there is the question of how we motivate people to contribute and run services. Reward systems need to change so that researchers who do a good job of constructing services get recognized and promoted. We also need to train people to create services: arguably, we need a new class of “data scientists” expert in these issues. Finally—and here is where we get back to Grid—we need substantial new infrastructure to host services. Let me explain why.

Here is a somewhat simplistic view of a virtual observatory. Let us assume that we have configured what is now a rather small digital sky survey, the Sloan, some 10 terabytes in size, to run on our small local server. Initially, we and our users are delighted: astronomers around the world can use their Web browsers to retrieve data about individual astronomical objects. However, we soon find that astronomers are writing programs that ask more complex questions, involving perhaps tens of thousands of objects. And then the number of people asking questions increases. Suddenly we need many many computers to meet demand, and that is not something that our small group is set up to handle.

Such issues point to a new role for the traditional supercomputer center, as a hoster of services. I will illustrate how this can work by describing a service we have constructed at Chicago, in collaboration with some astronomers. The problem we have addressed is that of stacking images from different areas of the sky, something one does to improve signal to noise ratios when looking at, for example, quasars. One may want to access tens of thousands of cutouts from different areas in the sky, which is both a data-intensive and a computation-intensive task.

We have built a service to perform this function that runs on the TeraGrid. This service is constructed to acquire and release resources dynamically as load varies, thus allowing it to provide good response times regardless of load. To give an idea of the revolutionary impact such technologies can have, we are able to perform in 3 minutes a stacking that previously took a postdoc 3 months. The need for such services is going to explode in the coming years, as data volumes increase, the analyses performed on that data become more sophisticated, and users become more comfortable with service-oriented approaches to science.

To summarize, I have addressed three issues in my talk. First, the broader context, which is the impact of technological exponentials on scientific methodologies and organizations, and the consequent need for new information technology. Second, the important role that Grid plays as an unifying concept and technology for applications that require the federation of distributed resources, and the successes that have been achieved in using Grid technologies to enable on-demand access to computers, storage, data, and other resources. Third, the significance of the transition that we are currently seeing to service-oriented science, which I think has profound implications for what it means to be creative, to communicate scientific results, and to build infrastructure for science.

September 05, 2006

WS-ResourceTransfer Specification Released

Some technology news that will concern only Web Services enthusiasts. But good news, nonetheless. In short: we may be nearing the end of the odyssey that started back in 2001 when we released the Web Services specification for managing state called Open Grid Services Infrastructure (OGSI).

Like Ulysses, we didn't plan on an Odyssey: our ambitions with OGSI were to define basic mechanisms as a first step towards more interesting work. However, first some people didn't like our aggressive use of WSDL 2.0 features (in retrospect a mistake, as WSDL 2.0 still isn't widely supported), which spurred the definition of WS Resource Framework (WSRF). Then industry politics led to the competing WS-Transfer specifications.

But finally sanity seems to have prevailed. Microsoft, IBM, and HP just released the new WS-ResourceTransfer (WS-RT) specification, bringing WSRF WS-ResourceProperties functionality into the WS-Transfer universe. This specification seems to provide all of the WS-ResourceProperties functionality used in Globus Toolkit version 4 (GT4): in particular, GetResourceProperty, GetMultipleResourceProperties, and QueryResourceProperties functionality. It also seems straightforward to integrate notification, which will be done in a future WS-EventNotification spec. There is even a Create operation, included in OGSI but not in WSRF.

In summary, WS-RT seems to provide what we need in Globus, and in a manner consistent with WSRF/WSRP. Assuming WS-EventNotifcation does the obvious things, then going from WSRF to these new specifications should be fairly straightforward.

I don't imagine that the Globus community will rush to adopt these specifications, but I imagine that we will want to implement them in the not too distant future, so that people who want to work with them can do so.

September 04, 2006

Cyberinfrastructure for Humanities and Social Sciences

I just read "The Report of the American Council of Learned Societies Commission on Cyberinfrastructure for Humanities and Social Sciences." (Quite a mouthful.) As the report says:

Science and engineering have made great strides in using information technology to understand and shape the world around us. This report is focused on how these same technologies could help advance the study and interpretation of the vastly more messy and idiosyncratic realm of human experience.

This is a fascinating and compelling ambition and vision. However, while I enjoyed reading the report, I thought it could have said much more about how to achieve that goal.

One new insight (probably obvious to most others) that I gained from the report was the extent to which, in contrast to at least most science and engineering (maybe species diversity is an exception, and astronomy due to the large amateur astronomy community), the humanities need cyberinfrastructure not simply to enable innovative research approaches, but also for purposes of preservation and access (in their case, of/to the human cultural record).

Much of the report is concerned with the latter topic. It makes a strong case for investment in the creation and maintenance of collections, and for openness in access and standards. It is hard to disagree with these conclusions. On the other hand, there is little consideration given to how to prioritize such work given scarce resources--a question that presumably should depend in part on what
are viewed as research priorities.

The Commision's charge included these questions:

What are the "grand challenge" problems for the humanities and social sciences in the coming decade? Are they tractable to computation?

The answers to these questions seem critical to the future of not only the humanities and social sciences but also (if we believe that the humanities and social sciences are relevant to society) to
humanity. Unfortunately, we do not find these answers in this report. Nor do we learn which aspects of cyberinfrastructure, and investigative approaches, are most likely to be useful.

The report does make some interesting remarks on the wide variety of methods that may be applicable:

The activity of discovering and interpreting patterns in large collections of digital information is often called data-mining (or sometimes, when it is confined to text, text-mining), but data-mining is only one investigative method, or class of methods, that will become more useful in the humanities and the social sciences as we bring greater computing power to bear on larger and larger collections, and more complex research questions, often with outcomes in areas other than that for which the data was originally collected. Beyond data mining, there are many other ways of animating and exploring the integrated cultural record. They include simulations that reverse-engineer historical events to understand what caused them and how things might have turned out differently; game-play that allows us to tinker with the creation and reception of works of art; role-playing in social situations with autonomous agents, or using virtual worlds to understand behavior in the real world.

A broad and exciting list. But in the absence of defined research priorities for the humanities and social sciences, and an understanding of where those prioritized research tasks can benefit from computation, we can't even start to discuss which of these techniques are most important to pursue.