My Photo

June 19, 2008

Argonne is Number One ...

It is worth noting that Argonne's IBM Blue Gene system is officially the fastest supercomputer worldwide for open science, according to the June 2008 Top 500 rankings. It is also third fastest overall, after systems at LANL and LLNL used for nuclear weapons work. Many kudos to Ray Bair, Pete Beckman, and the rest of the team at the Argonne Leadership Computing Facility. (And to IBM too ...)

Some of my colleagues, notably Ioan Raicu, Zhao Zhang, and Mike Wilde, have been running large "many task" applications in biology and economics on this system, using Falkon and Swift. So far we're up to 32K of the total 160K cores, and it's a delight to work with.

August 23, 2007

reCAPTCHA: stop spam read books

I'm back from a longish vacation on Lake Huron, North Manitou Island, and other pleasant places. I think I'll try starting up my blog again.

Captcha1187900590
Here's a fascinating follow up to my post a while back on open source problem solving in science: Luis von Ahn's latest project that uses fragments of old text as CAPTCHAs. Thus, when typing those irritating words you have to type to prove you are human you are helping transcribe old books. Brilliant.

July 12, 2007

Iphone

In a rare moment of technological enthusiasm, I bought an iPhone. I am learning about fun things to do with it. (No, it seems to be a great device, although it's not yet clear to me that it is better than  my Blackberry.)

July 09, 2007

Article describes U.Chicago TeraPort

An article today in the University of Chicago Chronicle talks about our TeraPort cluster, which while a humble system, is of great importance to us in the Computation Institute. As well as the usual applications from the physical and biological sciences, we have less conventional users who can comment that, for example:

“Initial work on lexical choices in this collection has revealed striking differences in the word use and sense between male and female authors and characters, as well as American and non-American authors.” [Mark Olsen and his co-authors in a paper on data mining of black drama.]

May 31, 2007

University of Canterbury buys Blue Gene

My alma mater, the University of Canterbury in New Zealand, just announced that they will acquire a 4,096 CPU IBM Blue Gene (BG/L) supercomputer. This will give New Zealand a second entry in the Top 500 list, well ahead of Weta Digital. (Funnily, the local paper in Christchurch reported that they were acquiring a 130,000 CPU, 270 teraop/sec Blue Gene. That would have been something.)
I'm off to New Zealand the first week of July for the KAREN Forum in Auckland--basically an eScience conference, organized by the New Zealand advanced network organization.

May 25, 2007

And then there were five ...

Following my post about the four European middleware platforms, I am told about a fifth, ExtreemOS. "The XtreemOS system will offer an alternative to the Globus toolkit, which is currently the most widespread middleware system." A noble goal!

ExtreemOS will extend Linux with native (kernel-level) support for virtual organizations. I like the concept of building support for grid execution in at the lowest levels, so that every computer, whether desktop or palmtop, is ready to participate in collaborative activities. On the other hand, I am not at all sure what form that support should take--or even if we need more than is already provided in today's kernel. I would think that a working hypothesis should be that we don't need any.

Interestingly, Red Hat announced plans to integrate Condor scheduling capabilities into their Linux distribution. However, their goals are fairly narrow, focused on facilitating the use of remote computers for computation.

April 09, 2007

Zettabytes!

A recent IDC report claims that:

  • The world created 161 exabytes (1.6 x 10^20 bytes) of digital data in 2006
  • By 2010, annual data output will reach one zettabyte (1 x 10^21 bytes)
  • In 2006, there were one billion devices capable of capturing digital images

The 2003 Berkeley study estimated 5 exabytes were produced in 2002, and a 30% annual growth rate. Thus the IDC estimates are considerably higher. The reason is that IDC takes a much broader definition of "created", including digital cameras, local copies of data, etc.

February 28, 2007

MidWest Grid Workshop

The MidWest Grid Workshop will be held at the University of Illinois in Chicago on March 24 and 25. From the Web page

The aim of the workshop is to give the students a basic foundation in distributed computing, and valuable hands-on training in computing techniques. The workshop introduces essential skills that will be needed by students in the natural and applied sciences, engineering, and computer science to conduct and support scientific analysis in the emerging grid computing environment.

Participants will work with some of the world's leading experts in grid computing, through a blend of lectures, discussions and hands-on computing exercises on large-scale grid hardware and software resources.

December 21, 2006

caBIG releases caGrid

I've written previously about the cancer Biomedical Informatics Grid, caBIG, a national-scale network linking research laboratories, cancer centers, and investigator projects to accelerate the development of effective patient therapies for cancer. They just released the (Globus-based) caGrid version 1.0, which implements the core Grid architecture of caBIG to support scientific use cases from the cancer research community. A nice way to end the year. 

December 10, 2006

The S stands for?

My tongue-in-cheek post a while back on Web Fundamentalism generated lots of interesting traffic and pointers. At some point I must internalize and summarize it all, but for now I just read (some of) it. The best thing I've seen so far is Peter Lacey's The S stands for Simple, a hilarious and very relevant Socratic dialog.

Continue reading "The S stands for?" »

November 27, 2006

Gifting Technologies

I enjoyed reading a recent article by Matei Ripeanu and friends, "Gifting technologies: A BitTorrent case study." They look at a set of six BitTorrent communities with different properties and policies, and compare and contrast various metrics such as degree of freeloading and relative contribution of most frequent uploaders. Arguably some of the conclusions regarding how best to encourage "gifting" are obvious, but I don't think they all are, and there are interesting insights into the relative importance of different factors.

Continue reading "Gifting Technologies" »

November 21, 2006

System-level Science

This month's issue of IEEE Computer includes four articles on system-level science: the integration of diverse sources of knowledge about the constituent parts of a complex system with the goal of obtaining an understanding of the system's properties as a whole. This being IEEE Computer, they focus in particular on information technology (IT) issues involved in achieving scientific goals:

[S]ystem-level science integrates not only different disciplines but also, typically, software systems, data, computing resources, and people. System-level science is usually a team pursuit. Data comes from different sources, different groups develop component models, team members provide specialized expertise, and the often substantial computing and data resources required for success are themselves diverse and distributed. Thus, system-level science itself requires the creation of yet another sort of system that may combine large numbers of both physical and human components.

Continue reading "System-level Science" »

November 20, 2006

Grid in Government Computer News

Today's issue of Government Computer News (not the most gripping title for a publication ...) has a long article on Grid. The subtitle is, although proven in academia and research, grid computing struggles to find a place in the enterprise, and the author discusses at some length both where grid has been successful and where it has yet to catch on. It's mostly a fair analysis, and the comments on continued relative difficulty of deployment are right on (although improving, thanks to tools such as Introduce). I'd suggest, though, that one reason for the challenging nature of grid deployments is often sheer ambition. Projects like caBIG and TeraGrid, for example, are complex. But they are achieving things that have never been done before.

October 25, 2006

Perspectives on Open Source

The topic of open source arose frequently at the recent GlobusWORLD conference. I find the variety of perspectives on this topic fascinating. I have heard various people opine that:

  1. Open source is diabolical, because it discourages innovation and/or is risky from a legal perspective. Shai Agassi of SAP expressed such views in a much-reported 2005 speech.
  2. Open source is angelic, because it ensures that "speech" (or at least coding) is "free." Richard Stallman is a well-known proponent of this view.
  3. Open source is inevitable, for economic reasons, and, as such, should be embraced as part of the IT ecosystem.

The first two views are familiar; the third is newer, and I think far more interesting, as it permits (at least in principle) a quantitative discussion about when and where it makes sense for software to be open vs. closed

Underlying this third view is an evolving perspective on where value lies in software. For a long time, value was seen in the basic software itself, viewed as intellectual property. Now, the basic software is increasingly seen as a commodity. Of course, companies still need to ensure that the software functions on a daily basis, and they typically don't want to maintain the necessary expertise inhouse. Thus, as Gartner wrote recently:

"open source software is a catalyst that will restructure the industry, producing higher-quality software at lower cost ... it will revolutionize software markets by moving revenue streams to services and support and away from license fees."

Vendors larger and small are taking major positions on these views, betting their future on proprietary software (e.g., SAP, Microsoft), open source software (e.g., RedHat, Novell), or both (e.g., IBM, Oracle). It's a fascinating evolution.

Does this mean that IT itself is a commodity? Not exactly. I was talking to Reagan Moore last week, and he expressed the view that value is increasingly in the (proprietary) policies that govern how (open) software is used. That's a perspective that resonates with my experience.

October 09, 2006

Writing Distributed Programs: Why Not Message Passing?

The question of how to write programs for distributed or "grid" environments has stimulated much debate. Some argue that this new environment demands new programming models and languages--and there is certainly merit in that view. However, we can also reuse well-understood models. For example, we can use the Message Passing Interface (MPI) standard to write message passing programs.

The MPI standard defines an API for sending and receiving messages, in both point-to-point and collective modes, and for such things as dynamic process creation. MPI is sometimes criticized as a low-level "assembly language," but it is more accurate to describe it as an abstract but precise notation for describing data exchanges among concurrently executing processes.Mpichg2

To run message passing programs on grids, consider MPICH-G2 (see paper), a grid-enabled  MPI implementation developed by Nick Karonis and his colleagues. MPICH-G2 allows a user to run MPI programs across multiple computers, at the same or different sites, using the same commands that would be used on a parallel computer. It extends the Argonne MPICH implementation of MPI to use Globus services for authentication, authorization, resource allocation, executable staging, and I/O, as well as for process creation, monitoring, and control. Various performance-critical operations, including startup and collective operations, are configured to exploit network topology information. The library also exploits MPI constructs for performance management; for example, the MPI communicator construct is used for application-level discovery of, and adaptation to, network topology. Thus, the user can variously ignore or exploit knowledge of critical aspects of the heterogeneous environment.

MPICH-G2 has been used to run scientifically important applications. One I like is a high-resolution study of blood flow in the human body: highly coupled 3-D simulations of blood flow in critical areas are placed on distinct clusters, and those simulations are coupled via a 1-D simulation of flow through the arterial system.

MPICH-G2 doesn't do everything: for example, it is not particularly fault tolerant. But if you want to run a program fast on a set of distributed computers (on a LAN, MAN, or WAN), and are prepared to accept failure of one component resulting in failure of the whole (as is often desirable, in fact), it's a powerful tool.

For more information, see: N.T. Karonis, B. Toonen, and I. Foster, "MPICH-G2: A Grid-Enabled Implementation of the Message Passing Interface," J. Parallel and Distributed Computing, vol. 63, no. 5, 2003, pp. 551–563. There are also a number of application papers available.

October 07, 2006

Nicholas Carr, Amazon Web Services, and Globus

The massive infrastructure investments being made by companies like Google, Amazon, eBay, and Microsoft are having interesting consequences. First, we get 500,000 computers at Google indexing the Web for us, for free. Then, via Amazon Web Services, we get on-demand access to storage, computing, and (most recently) message delivery, all via simple Web service interfaces--not for free, but at a relatively low cost.

Nicholas "IT Doesn't Matter" Carr reported recently on a speech and an interview by Jeff Bezos on Amazon's forays into Web services. Carr's posting, and subsequent comments, raise some interesting questions about the economics of IT as a utility. If Amazon's current offering takes advantage of the fact that its computers are often idle, what happens as demand increases? When is it better to outsource vs. insource? We don't understand such issues yet, but I can't help suspecting that that crazy "grid" idea is becoming very real.

Carr perhaps views Amazon Web Services as support for his view that IT has become a commodity. But grid utilities providing computing, storage, and other basic functions seem likely rather to spur explosive innovation in IT applications. What new applications we will see? (Those described for Amazon seem, so far, rather dull: e.g., backup and picture archiving.) Will those new applications demand new capabilities from utilities, spurring differentiation? And when will people want to deploy their own implementations of these services, rather than trusting to others to provide them?

Our Globus software provides grid utility services such as  GRAM (on-demand service deployment), Workspaces (virtual machines), and GridFTP (storage). These services have interfaces richer than those of similar Amazon Web Services, and the additional capabilities have proved important when deploying those services at remote locations and when using them to implement higher-level services such as policy-driven data delivery (e.g., DRS) and distributed computing (e.g., VDS). However, I believe that some fairly simple refactoring can allow those same higher-level services to drive operations on utilities provided by the likes of Amazon. That will make it feasible for users to mix outsourced and insourced IT functions in interesting ways.

October 06, 2006

Grid Fighting Cancer

The National Institutes of Health Cancer Biomedical Informatics Grid (caBIG) is one of the most exciting Grid deployments out there. There's a nice NIH Web site with background information. Quoting that site:

The National Cancer Institute (NCI) has launched the caBIG™ (cancer Biomedical Informatics Grid™) initiative to speed research discoveries and improve patient outcomes by linking researchers, physicians, and patients throughout the cancer community. caBIG™ is a voluntary network of infrastructure, tools, and ideas that enables the collection, analysis, and sharing of data and knowledge along the entire research pathway from laboratory bench to patient bedside.

I like caBIG for two reasons. The first is my family history of cancer )-:. The second is that they are one of the most ambitious and successful users of Globus software that I know. There are more an 800 people working on 70 projects within caBIG, and every caBIG Web Service is built on Globus technology. The caBIG software distribution uses just about every Globus component. In addition, caBIG has developed some nice new functionality to Globus, including:

I'm as much a fan of the search for the Higg's Boson as anyone, but there is something to be said about finding a cure for cancer!

Title_connection

October 05, 2006

Quantifying the Benefits of Cyberinfrastructure

We need to find a way of quantifying the benefits of "cyberinfrastructure"--the technology that underpins and enables eScience. We need this information if we are to justify spending on infrastructure (or not), decide what infrastructure to build, and understand how to improve the infrastructures that we have.

But quantifying benefits is hard.

An anecdote: In building the Globus-based Earth System Grid (ESG: see the picture for participating sites) we put a lot of effort into instrumentation and quantifying usage. Thus we can know that our more than 3000 registered users have downloaded more than 100 Terabytes of climate simulation data. Yet this data does not provide any real insight into whether the people downloading that data found it useful--or did anything useful with it.Usmap_1 We did survey users, and got useful information, but response rates were low.

Fortunately, one of the two data collections made accessible via ESG was the International Panel on Climate Change (IPCC) assessment simulation data, and the IPCC team was able to document that over 300 scientific papers had been produced [by early 2006] from data downloaded from ESG.

However, we can't always get such nice data. Thus, we may ask: What metrics are important? What data do we need? What is feasible to get? How do we get it? What can it tell us (and what not)?

I think we need to learn how to build infrastructures that can collect this sort of information automatically. We should involve social scientists in designing such systems and in assessing their effectiveness.

October 03, 2006

Who Invented Hypertext?

It's always fun to find that ideas we think are unique to our generation are in fact far older. For example, who invented hypertext?

Many might assert that it was Tim Berners-Lee, with his invention of the Web (1988). But while Sir Tim did (and continues to do) many wonderful things, the idea of hypertext greatly predates the Web.

Other common replies, at least among technologists, might be Ted Nelson, who in his book Literary Machines (1983) and his ambitious but ultimately unsuccessful Xanadu system pioneered many relevant ideas, and Doug Engelbart, who pioneered hypertext and many other things besides.

Historians of science are likely to cite Vannevar Bush's As We May Think (1945), which is notable as a description of a hypertext system that (essentially) predated computers, and influenced Nelson and Engelbart.

There are other precursors, but (getting to the punchline), I learned at a recent workshop of the work of the Belgian Paul Otlet, who from 1895 onwards described and built systems that (using cards, not computers) introduced ideas that (now quoting Wikipedia):Otlet

prefigured what ultimately became the World Wide Web. His vision of a great network of knowledge was centered on documents and included the notions of hyperlinks, search engines, remote access, and social networks. (Obviously these notions were described by different names.)

If he's in Wikipedia, he can't be that obscure (can he?), but this was all news to me.

If you want to learn more, there's a biography, written by W. Boyd Rayward, then at the University of Chicago. But his primary works remain untranslated.

October 02, 2006

History and Theory of Infrastructure

I'm just back from a workshop on "History and Theory of Infrastructure: Lessons for New Scientific Infrastructure" in Ann Arbor, Michigan, which brought together a fascinating group of social scientists and others to discuss "what practical lessons can the history, sociology, and experience of existing infrastructures offer to the imagination, implementation, and governance of cyberinfrastructure."

One delightful aspect of the meeting was meeting wonderful scholars that I had known previously only by reputation, such as Geoff Bowker, Leigh Star, Paul Duguid, and Christine Borgman, as well as some I already knew, such as Tom Finholt, Bob Kahn, Dan Atkins, and Bill Dutton, and others that I was glad to get to know.

There were many fascinating and wide-ranging discussions. My impressions:

  • Social scientists (or at least those at the University of Michigan's School of Information) organize great meetings. The organizers had clearly put a lot of thought into how to structure the meeting to ensure useful discussion, and they also had excellent social events!
  • The mode of discussion was quite different from I expected. There were no formal presentations and little analysis, but many compelling anecdotes. At first, I found this strange, but then realized that "stories" are a compelling way  of conveying insights. That got me thinking: what "stories" should we be telling people embarking on cyberinfrastructure projects, to help them avoid mistakes and achieve success?
  • Another thought that seemed interesting, at least to me: How about designing cyberinfrastructure to collect the information that social scientists require to evaluate its utility? Large systems like TeraGrid, Open Science Grid, Earth System Grid, caBIG, or GEON, and also smaller systems, could be viewed as experimental apparatus for social scientists. What instrumentation should we include in them to that end?

Overall, I didn't come away convinced that the history of existing infrastructures can help those building cyberinfrastructure: railroads and networks are very different thing. But I became yet more convinced that social scientists have a lot to contribute to our understanding of how science and its tools will, and should, evolve in the 21st Century.

October 01, 2006

Rapture for Nerds

I have long been fascinated by apocalyptic and millennial thinking: belief systems in which the world is about to be changed in some fundamental way by a transformative event of an esoteric nature. Typically:

  • The transformation will usher in an era of prosperity, peace, and immortality.
  • Only a select few will get to participate.
  • The transformation will occur within a small number of years: certainly within the lifetime of those involved, and often on a specific date.

During human history we find hundreds of examples of groups who have believed that they possessed information regarding such an imminent transformation. The reccurence of this idea surely tells us something profound about the human spirit.

I was reminded of this topic by "Radical Evolution" Joel Garreau's interesting book about potential futures. The book presents the views of those who predict a potential "singularity": a time at which, due to continued exponential growth in computer power, we obtain computers able to design yet more powerful computers, and thus enter into an era of essentially infinitely rapid change in technological capability. These developments also enable superhuman intelligence, medical advances, thus eternal life, etc., etc.--but only for those prepared to take advantage of these advances.

I've always found the similarities between the "singularity" and millennial ideas intriguing. Others have apparently thought the same, and furthermore coined the beautiful put down "Rapture for Nerds." Now of course either the singularity or the rapture (or both) may turn out to be quite real, but the similarities between the two concepts is certainly cause for thought.

September 30, 2006

Mapping in Time and Place

I had an interesting conversation today with Michael Buckland about the importance of mapping historical cultural data to time and place. Most documents refer to place names, which may be ambiguous (e.g., country names come and go, town names change or are reused), and refer to time in similarly ambiguous ways (e.g., "last year", "during the summer", "when I was 10", "after the war"). If such references can be disambiguated, then it becomes possible to see connections that might not otherwise be visible.

Michael Buckland directs the Electronic Cultural Atlas Initiative (ECAI) an international project to develop and distribute digital data on historical and archaeological resources. To this end, they are working to "create digital maps that display a wide range of cultural material by using place and time as a common element."

Apparently current Geographical Information System (GIS) tools just don't deal with time in an adequate way. One exception is the University of Sydney's TimeMap system, which ECAI uses.

I've always loved maps, and we are seeing from recent innovations such as Google Maps just how powerful it can be to enable easy mapping of diverse data to geographical space. But I had never thought about the temporal dimension.

Ecaitop520

September 29, 2006

New Zealand Gets Wired

Logo Having grown up in New Zealand, I am delighted that the country finally has a high-speed research and education network, the Kiwi Advanced Research and Education Network (KAREN). Officially launched on August 31, this network links all of the major research institutions via a 10 Gbit/sec backbone.

The creation of a decent research infrastructure for New Zealand has taken a while. It's always going to be a challenge linking a country in which just 4 million people are spread over a fairly large area. However, while New Zealand has long had a high penetration of Internet technologies, things have been made worse by a lack of investment in research over the past 20 years, and by policies that have encouraged competition rather than cooperation among research universities and laboratories. Fortunately, these policies seem to be changing.

I've been thinking about these things since 2004, when I visited New Zealand and gave a series of talks to people involved in planning research infrastructure. I quoted Woody Allen: "80% of success is showing up", and pointed out that while the world is shrinking rapidly, it is not doing so uniformly. I noted that in 2004, I could send 1 terabyte (1 trillion bytes) to Geneva from Chicago in 20 minutes, but it took me four hours to download 1 megabyte (1 million bytes) from Chicago to Wellington. This difference reflects what we might call the dirty underside of exponentials: if network speeds are doubling every nine months, then a mere 10 years lag in network deployment means you are 10,000x slower than the competition. And in a world where one's ability to compete depends on access to information and colleagues, that difference can be fatal. Thus it's exciting to see that New Zealand has caught up--at least for a while.

I also spoke during that visit of the limiting effect of what I termed "PC Science," i.e., science scaled to fit on one's personal computer. Such limited approaches constrain the questions asked and the answers obtained. They can also (I fear) limit one's ability to enlist the best students, who are looking for things that are exciting and cutting edge. Fortunately, once you have high-speed networks, it becomes far more feasible to link users with clusters, supercomputers, databases, and collections of PCs to provide access to powerful computational capabilities. Thus I am also pleased to see my alma mater, the University of Canterbury, acquire a powerful supercomputer.

September 26, 2006

The Many Faces of IT as Service

While trying to define Grid may well be a hopeless task, it is certainly useful and feasible to talk about the different elements of the service-oriented ecosystem. That is what Steve Tuecke and I did in a recent article, "Describing the Elephant: The Different Faces of IT as Service."

The introduction to this article explains what it is about:

In a well-known story, a group of blind men are asked to describe an elephant. Each encounters a different part of the animal, and not surprisingly provides a different description.

We see a similar degree of confusion in the IT industry today, as terms like service-oriented architecture, Grid, utility computing, on-demand, adaptive enterprise, data center automation, and virtualization are bandied about. As when listening to the blind men, it can be hard to know what reality lies behind the words, whether and how the different pieces fit together, and what we should be doing about the animal(s) that are being described. (Of course, in the case of the blind men, we did not also have marketing departments in the mix!)

Our goal in this article is to shed some light on these matters and provide, in effect, a description of the elephant. More specifically, we describe what we see as a major technology trend that is driving many related efforts, namely the transformation from vertically integrated silos to horizontally integrated, service-oriented systems. We explain how various popular terms relate to this overarching trend, and describe the technology required to realize this transformation.

As does the summary:

We have argued that SOA, grid, on-demand, utility computing, software as service, and other related terms all represent different perspectives on the same overall goal—namely, the restructuring of enterprise IT as a horizontally integrated, service-oriented architecture. If successfully realized, that goal will see in-house, third-party, and outsourced applications all operating in a uniform environment, with on-demand provisioning of both in-house and outsourced hardware resources—and also, of course, high degrees of security, monitoring, auditing, and management.

This Holy Grail of open, standards-based, autonomically managed software and dynamically provisioned hardware has certainly not yet been achieved. That does not mean, however, that enterprises cannot start today to create horizontally integrated, service-oriented infrastructures. Solid Web services products allow for the creation of service-oriented applications. Mature commercial and open source virtualization and workload management products and open source grid infrastructure software provide what is needed to create horizontally integrated infrastructure to sit behind those applications. Integration remains more of an exercise for the customers (or their services vendors) than is desirable, but that situation should change as independent software vendors start to grid-enable their products. Meanwhile, progress on further standards is accelerating as experience is gained with deployments and pressure builds from end users for interoperable solutions.

September 23, 2006

What is the Grid, anyway?

I was recently asked to provide a definition of "Grid" for the layman. I wrote a piece a while back on "what is the grid." I still like that definition--although I've also decided that trying to define such things is a hopeless task. But here goes another attempt.

Having bought a new toaster, we simply plug it in: the electric power grid obviates the need to also buy and install a new generator. By analogy, information technologists refer to "the grid" when talking about on-demand computing.

Like its namesake, a grid is a mix of technology, infrastructure, and standards. The technology is software that allows resource providers (whether individuals or institutions) to federate computers, storage, data, networks, and other resources, and for resource consumers to harness those federated resources when needed. The infrastructure comprises the physical hardware and services that must be maintained and operated for this resource federation and access to occur. Finally, standards codify the messages that must be exchanged, and the policies that must be followed, to achieve those goals.

There is a subtle but important distinction between "a grid" and "the Grid." Any system that allows for resource federation and on-demand access is arguably a "grid", whether general-purpose or application-specific, small or large. The Grid, like "the Internet", denotes the global set of computers that speak the same protocols.  In that sense, "the Grid" is a work in progress, as relevant standards continue to be codified and adopted.

September 21, 2006

Data Analysis Challenges

An important trend with broad implications is the extent to which data analysis tasks are becoming computationally demanding. The problem is that data volumes are growing exponentially, driven by Moore's law; meanwhile, many interesting analyses depend on the intercomparison of data items, and thus have a cost that grows faster than linearly with the amount of data. Thus even exponentially improving processors can't keep up. The fact that storage costs are currently decreasing faster than computing costs makes things worse.

Folker_figure1_2 We see the impact of these issues in the attached figure, from a nice article by Folker Meyer of the Argonne/Chicago Computation Institute in CTWatch, with contrasts the number of genetic sequences obtained with the number of annotations generated. The issues here are not solely computational, as many annotatons are generated manually. But nevertheless, it is striking to see how fast we're falling behind.

As always, a solution to this problem will need to combine improvements in hardware, software, and algorithms:

  • Hardware: Because individual devices aren't getting faster particularly rapidly, we will see increasing parallelism in storage, computers, and networks. We hear about these trends a great deal at places like Google, but it is becoming widespread.
  • Software: As the number of devices and the amount of work to do both increases, software needs to get smarter. We need to orchestrate massively parallel computations across many devices and manage the flow of data into and out of (and amomg) those computations--and, wherever possible, avoid performing computations by caching and other techniques.
  • Algorithms: Neither hardware nor software improvements can overcome the basic exponentials. Thus we need better algorithms. Probablistic algorithms that perform sampling to extract "good enough" knowledge will become important. So will the ability to evaluate how "good" a particular conclusion really is.

September 20, 2006

Earth System Grid

I'm at the kickoff meeting for the next phase of the Globus-based Earth System Grid (ESG), a U.S. Department of Energy project developing technology to manage and provide access to large quantities of climate simulation data. The two ESG  portals provide access to more than 100 terabytes of output from U.S. and international climate models. The 4000 registered users have so far downloaded more than 130 terabytes of data as they ask questions such as "why are hurricane intensities increasing." Just last year, these users produced more than 300 scientific papers based on ESG data.

In the next phase, we face big challenges as the  quantity of data increases (new petaflop/s computers will generate 10-100 more data), data becomes more distributed (it can't all be moved to a central location, as at present), the user population becomes larger and more diverse (including, e.g., policy analysts as well as climate scientists), and the sophistication of the data analyses to be performed increases.

One important trend will be increased focus on server-side analysis: as data volumes increase, users must be able to request that data be processed at the data location rather than downloaded to their local system. They need access to data analysis services as well as data download functions, so that they can ask "compare the power spectrum of sea surface temperature in the Nino-3 region from these 10 models" rather than "download ocean temperature data for those models for a 100-year simulation period." Needless to say, server-side analysis of petabytes of data is not easy. We'll be working in the coming months to add such capabilities to ESG.

If you want to learn more, here is a fairly recent article on ESG architecture and implementation. Globus technology is used for data access, authentication and authorization, distributed system monitoring, and other purposes.

I see ESG as a premier example of service-oriented science--and also a success story for Grid technology.

September 12, 2006

Globus turns 10: Time for Celebration and Reflection

The following is the text of an article that I wrote for GridToday on Globus' 10th birthday, which we celebrated yesterday in Washington DC.


Globus Turns 10: Time for Celebration and Reflection

The GlobusWORLD conference being held (jointly with GridWorld and the Open Grid Forum) this week in Washington, D.C., is a significant milestone for those involved in the development and use of the Globus open source Grid software. The reason is that it was 10 years ago (to be precise, on Aug. 21, 1996) that Carl Kesselman and I received our first funding for work on Globus, from DARPA. Gary Minden and Mike St. Johns were our enlightened program managers, followed by Gary Koob. I must also recognize the support of Bob Aiken, Tom Kitchens and, especially, Mary Anne Scott, then all at DoE.

Given this milestone, I will spend some time here recapping history and reflecting on where we have come and what we have learned.

A Little History

10 years is a long time: What on earth have we been doing over that period? Let's revisit some of the highlights.

The emergence of high-speed networks in the 1990s led to an awareness that the Internet could allow for more interesting applications than e-mail and file transfer. (Len Kleinrock had envisioned this possibility back in 1969, but it took a while to get there!) Efforts like the U.S. Gigabit testbed project, led by Bob Kahn, and the Supercomputing'95 I-WAY effort, led by Tom DeFanti and Rick Stevens, helped build awareness of these opportunities. This era also saw pioneering efforts such as the NSF Metacenter, led by Charlie Catlett and Larry Smarr, and Legion, led by Andrew Grimshaw. However, for the most part, every application was constructed from scratch.

We (in particular, myself, Carl and Steve Tuecke) studied this situation and saw a need for standards and software (middleware) to bridge the gap between applications and the complexities of a distributed resource environment. Thus, we started a research project aimed at defining this middleware. Believing strongly that we did not necessarily know the real problems, we started an iterative process of examining the requirements of collaborative communities, prototyping solutions to their problems and feeding back the resulting experiences into a next cycle of research and development. We called this project Globus because it built on earlier technology called "Nexus" and had global goals.

Back in 1996, our ambitions and the needs of our users were far greater than our resources -- a situation that persists today! -- and so it was challenging to develop software that was sufficiently stable and functional to allow for meaningful experiments. Fortunately, we found wonderful application partners -- people like Ed Seidel, Paul Messina and their colleagues, and later members of the high energy physics community -- who were prepared to work with often imperfect software and provide invaluable feedback.

Along the way, we achieved milestones that helped persuade ourselves and others that we had something useful. For example, 1998 saw Sharon Brunett, Karl Czajkowski and others achieve a record-setting military simulation involving 100,298 vehicles distributed over 13 supercomputers at nine sites. Gregor von Laszewski and others demonstrated real-time analysis of data from the Advanced Photon Source. At the SC'98 conference, we demonstrated the "Globus Ubiquitous Supercomputing Testbed Organization" (GUSTO) that spanned some 50 sites worldwide. NASA launched its Information Power Grid project, under the leadership of Bill Johnston.

By 2001, the year in which the TeraGrid was founded, we had software we felt was ready to operate in production environments, if only we could find friendly sites prepared to perform the needed integration, and application scientists ready to develop the necessary application software. In practice, we weren't as ready as we thought we were, but nevertheless we entered a stage -- of learning via experience about the mechanisms and policies required for operational use -- that to some extent continues today. We also received some nice recognition at this time: Globus Toolkit version 2 (GT2) played a key role in a Gordon Bell prize awarded at SC'01 to an astrophysics application that used Cactus, MPICH-G2 and Globus. The following year, R&D Magazine recognized GT2 with an R&D 100 award and named it the "most promising new technology" of the year.

In late 2001, IBM followed up its dramatic open source Linux strategy announcement with a similar announcement about the importance of Grid technologies. We were thrilled when IBM elected to work with us to develop the OGSI Web Services specification and the corresponding Globus implementation, which was released in 2003 as GT3. While this first Web services release provided only modest quality, it spurred much innovative work, such as the video distribution system developed by the Belfast eScience Center for the BBC (to give an idea of the scale of effort underway by this time, BeSC applications alone totaled 1.5 million lines of GT3 code, later adapted for GT4).

2005 saw the release of Globus Toolkit version 4 (GT4), which, thanks to the efforts of talented developers and the able leadership of Lisa Childers, exceeded all previous releases in terms of quality and rigor of both software and documentation. GT4 supports the construction of stateful and secure Web services in Java, C and Python; provides job submission, file transfer, credential management, registry and database access services; incorporates a powerful integrated security system; and provides many other features besides. 2005 and 2006 also saw significant new funding in support of the Globus science community, from the U.S. National Science Foundation's NSF Middleware Initiative (under Kevin Thompson), UK eScience program (for work on OGSA-DAI) and, most recently, from the U.S. Department of Energy's SciDAC program.

Where We Are Today

Someone once dismissed Grid as a "funding concept" -- a witty but irritating turn of phrase. I have not heard that expression lately: Grid is mainstream in both science and industry, and so many people are using Grid technology to solve real problems that it is hard to argue that it is not successful and useful. Indeed, we can make a strong case that Grid has had a significant impact on how people conceptualize and solve problems in many domains.

It is particularly pleasing to see the diversity of Globus application communities, which span, for example, astronomy (e.g., the LIGO gravitational wave observatory, the Caltech Montage service), bioinformatics (e.g., Natalia Maltsev's PUMA system), cancer biology (e.g., the National Institutes of Health's caBIG cancer bioinformatics Grid), data mining (e.g., work by Domenico Talia) and environmental science (e.g., C3grid in Germany and Earth System Grid in the United States). And that is just the first five letters of the alphabet.

I am also delighted with the geographical diversity of Globus deployments. We see substantial Globus deployments and applications in every continent except Antarctica, and just about every day I get e-mail from someone somewhere describing a new deployment of which I was not previously aware. Again, we can walk through the alphabet: Australia, Belgium, China (and Canada and Chile), Denmark, England, France, Germany, Hungary, Ireland, Japan, Korea, Luxembourg, Mexico, the Netherlands, ....

Another area in which we continue to see wonderful progress is in the range of "solutions" that leverage Globus software. Globus middleware does not address end-user requirements directly, but a wide range of Globus-based tools now existing for building portals (e.g., OGCE, GridPort, Jason Novotny and Michael Russell's GridSphere); executing workflows (e.g., Ewa Deelman and Mike Wilde's VDS, David Abramson's Nimrod, Miron Livny's Condor, BPEL); running parallel programs (e.g., Nick Karonis' MPICH-G2); delivering data (e.g., Ann Chervenak's DRS, Reagan Moore's SRB); operating instruments (e.g., Rick McMullen's Common Instrument Middleware Architecture project, GridCC in Europe); remote service invocation (e.g., Ninf in Japan); and so on. Lee Liming has done a nice job documenting these and other "solutions."

It is also pleasing to see the progress being made in industry. Steve Tuecke left Argonne in 2004 to form Univa Corp., which provides commercial support for Globus software and is building new products using Globus (disclaimer: I am also a Univa founder and advisor). They are discovering that the concerns of industry are increasingly similar to those of science, as the need to accelerate innovation processes leads to a need for dynamic resource sharing between organizational units.

I should also mention the progress made with standards. Globus contributors, notably Von Welch, played major roles in the Grid Security Infrastructure standard, which has been widely adopted. The same is true for GridFTP, under the leadership of Bill Allcock. The Job Submission Description Language (JSDL) and Basic Execution Servie (BES) specifications, which seem likely to see wide adoption, build heavily on GRAM. Globus project members, notably Frank Siebenlist, have also contributed heavily to the increasingly important WS-Security, SAML2 and XACML specifications.

It is a nice coincidence, given our anniversary, that August saw the release of the WS-ResourceTransfer specification by HP, IBM, Intel and Microsoft -- perhaps signaling the end of a standards odyssey that began in 2001 when Steve Tuecke and others defined the Open Grid Services Infrastructure (OGSI). The goal was to codify Web services mechanisms for representing and accessing state, a requirement that appeared in many different contexts. Like Ulysses, we did not know we were embarking on an Odyssey when we began. However, the release of WS-ResourceTransfer -- remarkably similar to OGSI! -- suggests that we may soon reach this journey's end.

Also worthy of celebration is the tremendous growth in the size of the Globus developer community. In the beginning, there were just three of us, plus a few partners such as Craig Lee at the Aerospace Corp. The team grew over time, as talented researchers and developers joined us at Argonne, the University of Chicago and USC Information Sciences Institute, and then other organizations partnered with us, notably the National Center for Supercomputing Applications (Jim Basney, Von Welch and others), the University of Edinburgh (Malcolm Atkinson, Neil Chue Hong, Mark Parsons and others) and PDC in Sweden (Olle Mulmo and others). Most recently, the new dev.globus development process (modeled after that of Apache Jakarta) has partitioned Globus into dozens of independent projects, each with its own developers, and opened the way for new projects to join. The response has been enthusiastic: under the leadership of Jennifer Schopf, our new incubator process already has 11 incubator projects up and running.

Reflections

We have learned a tremendous amount in the past 10 years. It is hard to know where to start in terms of summarizing lessons learned, but here are a few thoughts.

We were clearly correct in identifying large-scale collaboration as an important problem, and in choosing science as a good place to start identifying requirements and experimenting with solutions. We have seen the need to federate data and computing, orchestrate the allocation of resources to different purposes and manage the policies that govern these activities become increasingly important, first across science and now in industry too. Indeed, these questions are arguably now central to the critical question of how innovation occurs within and across organizations.

Along the way, we have learned (and I am sure must continue to relearn) the need to evolve the software and to reinvent ourselves as both user requirements and the external technology environment evolve. For example, we adopted public key security technology early: a successful step, although the configuration tools needed for convenient use have taken time to emerge. We adopted LDAP as a directory service technology: less successful, and later abandoned. In 2002, we started a major shift to Web services technology: also a positive development overall, although we were arguably premature, given the maturity of Web services technologies at the time. In the future, we will need to respond to the emergence of commercial Web services, like Amazon's S3 and EC2 services, and to other developments that we have yet to recognize.

Our decision to pursue an open source approach and a non-viral license was also clearly correct. It was not necessarily the obvious choice back in 1996, and required a lot of hard work to define the necessary licenses and get the required approvals. (I realized just how much work when a lawyer asked Steve Tuecke, who handled much of the early work on licenses, if he had considered law school!) However, this choice has allowed us to scale the development team and user community in ways that would not have been possible with a proprietary solution. Our recent move to a pure Apache license is, I hope, the final culmination of this approach.

We have struggled with numerous issues over the years relating to the fact that any large-scale collaboration (and thus a grid) is a system and, as such, involves a great diversity of software, hardware, institutions and, above all, different people: users, tool developers, application developers, operations staff, security staff and others. The result is considerable complexity in terms of requirements and also significant challenges in how requirements and capabilities are communicated to different groups.

One inevitable consequence of this complexity is that Grid and Globus are not easily characterized, and thus we have struggled to overcome various misconceptions over the years. One is that Grid is somehow an alternative to high-end computing -- rather than an essential adjunct to high-end computing, enabling remote access and the distribution of the resulting data products. Another is that a Grid is about "free computing." A third is that Globus is a turnkey solution to Grid problems. We have been careful to emphasize that Globus is middleware, not application software, but we still hear complaints that "I installed Globus, but it didn't solve my problem."

I'd also say that we didn't internalize sufficiently at the beginning the extent to which Grid was a policy and operations problem. Fortunately, we've seen some wonderful people get involved with these issues, with the result that we have become increasingly good at creating and operating grids that work. Projects like EGEE, Open Science Grid and TeraGrid have taught us a lot.

In a different space, I remain concerned by the amount of redundancy and lack of interoperability that we see across the Grid community. Given the natural human enthusiasm for novelty (often encouraged by funding agencies and commercial pressures), this diversity is not a surprise. However, I expect that convergence will occur, as people come to understand the high cost of redundant effort, and the tremendous advantages of mature, robust, open source software.

Overall, though, the current situation and future prospects are incredibly encouraging and positive. The requirements that we set out to address with Globus 10 years ago have proved to be quasi-universal. It is no longer eccentric scientists and niche communities who use Grid technology, but mainstream science communities and (increasingly) commercial users. We have a set of technologies that, while certainly not a complete solution, address key requirements. We also see convergence on standards and increasingly broad adoption of those standards in both open source and proprietary software. Finally, and most important, we have a vibrant, sometimes contentious but always enthusiastic, international community of developers and users who are committed to moving the technology forward. We should all look forward to the 20th anniversary of Globus -- by which time, if the Internet is any guide, Grid technology will be ubiquitous.

In writing this document, I have tried to acknowledge some of the many contributors to Globus software, deployments and applications. I, of course, have omitted many more names than I have included. I hope that those omitted will forgive me, and that other readers will feel inspired to learn more about individual projects and those that made them happen.

Happy 10th birthday Globus!

September 11, 2006

dev.globus explained ...

An interview with Jennifer Schopf explains the "dev.globus" community development process that we established this year for Globus software.

September 09, 2006

HPC and Competitiveness

I participated on Thursday in a panel at the HPC Users Conference, run by the U.S. "Council on Competitiveness." I spoke on how the U.S. national laboratories can partner with companies in a mutually beneficial way. The panel reinforced for me some important points that I think need to be more broadly appreciated:

  • High-performance computing (HPC) is increasingly central to competitiveness, not just in traditional areas like aerospace and automotive, but also in new areas like corporate data mining and consumer product design. (Amusing success story: Procter & Gamble used a supercomputer to study the airflow over its Pringles potato chips to help stop them from fluttering off the company's assembly lines.)
  • Successful lab-industry partnerships can be about far more than access to supercomputers--they can involve codevelopment of advanced software systems. For example, Terry Talley spoke about how Axciom (they have your credit card data, if you live in the U.S.) had worked for four years with the PVFS team at Argonne.

The two industrial participants in the panel were interesting. Terry Talley talked about how Axciom is using Grid computing internally. The CTO from DreamWorks talked about the amount of computing involved in modern animated features: 10,000,000 CPU hours for Shrek1, 15,000,000 CPU hours for Shrek2, and so on. He also talked about how they are using DOE supercomputers in an exploration of interactive (instead of overnight) rendering. So even the most advanced users can imagine using computers in far more powerful ways.

September 08, 2006

Attribute-Based Authorization on TeraGrid

Charlie Catlett writes about plans to deploy attribute-basd authorization on TeraGrid. It is neat to see people working to make national-scale authentication and authorization work.

September 06, 2006

SOS in Japan

I participated in at an interesting conference in Japan last week, "The Fusion Between Policy Science and Information and Communication Technology," that brought together social scientists and computer scientists. I particularly enjoyed a talk by John Zysman and the panel discussion on whether and how advanced computing can help policy (more on those topics later, perhaps).

I was asked to speak on "Scientific Impact of Grid Computing." I enclose my talk below.

Scientific Impact of Grid Computing

The subject of my remarks today is the impact of grid computing on science. I will first provide some context, reviewing how the nature of science is changing as a result of (among other things) technological developments. I will then explain the relevance of Grid technologies to these developments and review experiences to date with the application of those technologies, and finally I will talk about how the advent of service oriented approaches promises (in my view) to transform many aspects of scientific research in the future.

First, context. We are talking today about grid and science for a reason, and that reason is the sustained exponential change in technology that has over the past 50 years been producing ever more data, enabling ever more computing, and connecting us all ever more closely.

The consequence of these developments is not only quantitative but also qualitative changes in how we tackle some of the most challenging and urgent scientific problems of our age, from climate change to disease. Increasingly, research involves the analysis of large quantities of data, large-scale numerical simulation, and intensive and interdisciplinary collaboration. The technologies that we used previously to store, transmit, process, and communicate data—workstations, DAT tapes, Fedex, even scientific journals, some would argue—are no longer as effective as they were.

We also see the emergence of new research methodologies and organizational structures, as evidenced by this image of the collaboration that is building the Large Hadron Collider at CERN. In this project, which is not so different in broad strokes from those that sequenced the human genome or that managed the response to SARS, we have different overlapping groups of varying sizes, some sharing data, some competing, all ultimately contributing to the solution of the problem at hand.

These developments have many profound implications for research methodologies, education, resource allocations, and so forth. In particular, they demand information technology infrastructures, and Grid is part of this emerging new technology landscape.

I should also note, just as an aside—but an important aside—that as science becomes more information intensive, so the importance of computer science increases. Astronomer George Djorgovski goes so far as to claim that, “applied computer science is now playing the role that mathematics did from the 17th through the 20th centuries: providing an orderly, formal framework and exploratory apparatus for other sciences.” As a computer scientist, that message appeals to me!

In this context, then, Grid has come to play a valuable role as a unifying concept and technology for applications that require the federation of resources (computers, storage, data, people, etc.). Why the name “grid”? Having bought a new rice cooker, we simply plug it in: the power grid obviates the need to also buy and install a new electrical generator. By analogy, information technologists refer to “the grid” when talking about on-demand computing.

Like its namesake, a grid is a mix of technology, infrastructure, and standards. The technology is software that allows resource providers to federate computers, storage, data, networks, and other resources, and for resource consumers to harness those federated resources when needed. We can categorize this software as “system-level” (software that implements common management interfaces to underlying resources, such as the open source Globus software that I have been involved in developing) and user-level (such as the Ninf software from Japan). Together, these software bridge the gap from applications to resources.

The infrastructure comprises the physical resources and services that must be maintained and operated for this resource federation and access to occur. Examples of services include registries and certificate authorities.

Finally, the standards codify the messages that must be exchanged, and the policies that must be followed, to achieve those goals.

Together, this technology, infrastructure, and standards allow us to bridge the otherwise substantial gap between applications and the physical computers, storage systems, networks, and other devices that those applications need to operate.

Let us move on now to a how Grid technology is being used in science. Today, this use is primarily directed at enabling, as I have indicated, on-demand access to computing, storage, and other devices. For example, the U.S. Network for Earthquake Engineering Simulation (NEES) implements service interfaces that allow for remote access to, and monitoring and control of, experimental apparatus for earthquake engineering as well as simulation codes and data archives. NEES has been used to conduct distributed hybrid experiments, in which components of a large structure are tested via a mix of numerical simulation and physical simulation at different sites. This is a technique pioneered in Japan, by the way. NEES is transforming the nature of earthquake engineering research in the U.S.

The Earth System Grid provides access to large climate model datasets such as those produced by the International Panel on Climate Change assessment. The substantial impact of this service on the climate research community is indicated by the large number of users, the number of data downloaded, and the number of resulting research articles.

The TeraGrid is the premier U.S. “cyberinfrastructure,” to use a term popular in the U.S. TeraGrid links supercomputers and storage systems at eight sites with an extremely fast network, and deploys standard Grid software across these resources so that scientists can obtain large amounts of computation and storage when required to support their science.

By thus standardizing on interfaces and policies, TeraGrid seeks to transform its diverse sites and computers into interchangeable providers of computing power. An application (for example, a medical data analysis application) can then acquire needed computing, storage, and network capacity to achieve its scientific objectives.

Increasingly, TeraGrid is being viewed as a system that does not simply provide computing resources for individual scientists, but also hosts services for communities. This emerging new role is significant and I believe will result in a considerably greater impact on the scientific community.

For example, PUMA is an information system that provides access to data computed by integrating genomic and proteomic data. To its several thousand users, it is simply a Web site. However, behind the scenes, PUMA is making extensive use of TeraGrid and other Grid infrastructures to perform its data integration. Indeed, PUMA code routinely runs on 1000 processors when integrated new data.

As these examples show, Grid as a technology for on-demand access to computing is already widely deployed, and is having a significant impact on science in numerous fields. Nevertheless, I believe that these successes are only a first step towards a far greater impact on science. This leads me to the third part of this talk, in which I discuss what I see as the next major thrust for grid computing and for science as a whole.

In traditional approaches to research, communication among researchers occurs primarily via publication in peer-reviewed journals. Information technology may play a role as a tool during the research process, but does not change the nature of this communication process.

What I call service-oriented science adds a new modality of communication, namely the creation of computational services—that is, network-accessible programs that implement a convenient interface and that provide access to data and/or computational capabilities.

Such services allow for new research methodologies, as follows. Someone publishes a service: for example, PUMA, which, as I described earlier, provides access to derived data products—or, perhaps, to an enhanced PUMA that allows its clients to supply their own genomic data to be integrated with that maintained by PUMA.

Another researcher discovers that service and uses it in their research. In a first instance, they may simply query PUMA from their Web browser. However, as they get more ambitious, they may also compose calls to PUMA with calls to other services (for example, a service for computing metabolic pathways) in what we call a workflow. In this way, they can scale up dramatically the number of questions that they can ask and get answered. This automation of data analysis tasks is an important consequence of service-oriented science.

Even more interesting is what can happen next. The researcher may decide that this workflow that they have developed captures a broadly useful analysis technique, and decide to publish that workflow as a new service that may itself be discovered and called by others. Thus we may achieve a virtuous circle of innovation.

The astronomers have been pioneers in the adoption of service-oriented science techniques. If you are not familiar with what they are doing, I encourage you to study it: it is very impressive.

So-called virtual observatories are providing on-line access to digital sky surveys at different wavelengths, thus allowing astronomers to ask sophisticated questions from the comfort of their desks: for example, what objects are visible in the infrared but not the optical? (The answer to this query can identify candidate brown dwarfs, a class of star identified only recently.