The authors of a recent OGF document, "Using Clouds to Provide Grids Higher Levels of Abstractions and Explicit Usage Modes" make several assertions with which I take exception:
1) "There is a level of agreement that computational Grids have not been able to deliver on the promise of better applications and usage scenarios."
It is fascinating to watch the Gartner hype cycle in action, if sad to see people stuck in the trough of disillusionment. But the fact is, fortunately, that there are substantial grid projects and applications that are having substantial success. Ones that come immediately to mind are the Earth System Grid, cancer Biomedical Informatics Grid, and the LIGO Scientific Collaboration, but as it was yesterday that the LHC was switched on, we should also recall the remarkable successes of the LHC Computing Grid and its partner projects such as Open Science Grid. At a different level, Globus people will be happy to talk about the millions of files moved via GridFTP every day, and Miron Livny will be happy to talk at length about how many millions of CPU hours are delivered every day via Condor.
2) To address this purported lack of success, "there is a need to expose less detail and provide functionality in a simplified way. If there is a lesson to be learned from Grids it is that the abstractions that Grids expose – to the end-user, to the deployers and to application developers – are inappropriate and they need to be higher level."
No evidence is provided for this assertion that complex interfaces are the reason for the difficulties people have with grids. I argue that the issues are more complex.
First, the interfaces themselves are not, in my view, a significant issue. We can argue whether we prefer REST or Web Services, or say Nimbus (a grid virtualization interface) or EC2 (a cloud virtualization interface), but the differences among these alternatives are not great.
On the other hand, the economic systems that apply in the two cases are extremely different:
- Amazon services are designed to support the masses, they have no political constraints on who they can provide service to, and their charging model provides strong return to scale; thus, Amazon can focus on, and succeed in providing, modest-scale, reliable, on-demand service to many.
- TeraGrid (to use a US example) is designed to support a small number of extreme computing users, with a negative return to scale (the more users, the more work for fixed budget); thus, they are not motivated to provide virtualization solutions or to operate highly reliable remote access interfaces.
The implications of these different foci for users are tremendous. On EC2, I give my credit card and start a VM--a few seconds. On TeraGrid, I request an allocation (which may not be granted!), get an account, submit a request to run a job (they won't allow me to start a VM), wait in the queue--a many week process. Furthermore, I sometimes find that the remote access interfaces fail because keeping them running is not high priority.
This alternative perspective is I think more revealing about the sources of the differences and the ways we might address them. If we want on-demand, high-quality, compute and storage services, then we need either to create an economic system in which academic providers are motivated to provide such services, or decide to outsource to industry.
The importance of higher-level interfaces is a separate issue. Yes, tools like Hadoop and Swift for data analysis, Introduce for service authoring, Taverna for service composition are important and necessary. Yes, we should be hoping to leverage and influence work done in the far larger corporate market to our advantage. (A focus of the upcoming CCA workshop.)
3) "Grids as currently designed and implemented are difficult to interoperate." The authors make a big deal of this point, but it is not clear to what purpose.
It is true that interoperation is not automatic. [If only everyone used Globus software, then all would be well :) --although of course the policy issues would remain!]. But I am not sure that this is a significant problem for users, or hard to achieve when it is needed. E.g., the caBIG team recently demonstrated a gateway to TeraGrid. The LHC Computing Grid integrates resources worldwlde. Etc. Most users never ask about interoperability, in my experience.
> The authors of this document make several assertions with
> which I take exception:
Dear Ian,
first of all, I am simply delighted that you found the time to
comment on our paper! Your remarks are very much appreciated!
Disclaimer: this answer is my personal take on your comments,
and may very well disagree with Geoffrey's and Shantenu's
opinion (in fact I know that Shantenu partly disagrees).
> 1) "There is a level of agreement that computational Grids
> have not been able to deliver on the promise of better
> applications and usage scenarios."
>
> It is fascinating to watch the Gartner hype curve in action,
> if sad to see people stuck in the trough of despondency. But
> the fact is, fortunately, that there are substantial Grid
> projects and applications that are having substantial success.
> Ones that come immediately to mind are the Earth System Grid,
> cancer Biomedical Informatics Grid, and the LIGO Scientific
> Collaboratory, but as it is today that the LHC was switched
> on, we should also recall the remarkable successes of the LHC
> Computing Grid. At a different level, Globus people will be
> happy to talk about the millions of files moved via GridFTP
> every day, and Miron Livny will be happy to talk at length
> about how many millions of CPU hours are delivered every day
> via Condor.
Yes, there are wonderful examples of successful Grids (a side
note: most of the successful ones are narrow Grids)! We are
not arguing that Grids failed (remember that we are active in
OGF, that we code Grid APIs, that we support Grid Application
developers, that we try to improve the LONI and TeraGrid
experience etc.).
It cannot have escaped your attention though that a large number
of user groups, which have hoped to flexibly and ubiquitously
harvest compute power by 'simply plugging into the Grid' are no
longer even discussing that topic. We lost a very large
potential user community, and continue to do so. Positive but
anecdotical examples do not change that.
If one wants to confirm this, this can simply be done by
walking over to some random human sciences department (or
medical, or agricultural, or art, or design, or, in general,
not-natural-science) at a University of your choice, and asking
a random staff member (not faculty) how she used 'the Grid' last
week. Alas, it is very unlikely that this question will yield an
answer confirming the triumph of Grids...
> 2) To address this purported lack of success, "there is a need
> to expose less detail and provide functionality in a simplified
> way. If there is a lesson to be learned from Grids it is that
> the abstractions that Grids expose – to the end-user, to the
> deployers and to application developers – are inappropriate
> and they need to be higher level."
>
> No evidence is provided for this assertion that complex
> interfaces are the reason for the difficulties people have
> with Grids. I argue that the issues are more complex.
You may be right: I might just be stuck in the 'trough of
despondency', and may simply not be up to date with the state of
affairs. And true, we don't provide much evidence - we should
address this, thanks for these comments in particular!
It is however not the topic of the paper to evaluate the
performance of Grids -- it is merely part of the motivation for
our approach, and is an observation we make. Thus, the comments
below are not really related to the published paper, but are
rather a direct answer to your blog post.
And, yes, issues are almost always more complex...
> First, the interfaces themselves are not, in my view, a
> significant issue. We can argue whether we prefer REST or Web
> Services, or say Nimbus (a Grid virtualization interface) or
> EC2 (a cloud virtualization interface), but the differences
> among these alternatives are not great.
It is not about the technology (REST vs WS vs Nimbus etc), it
is about the level of detail being exposed. For example Globus
(especially chosen for you :-) :
- go to Globus 4.0.2 WS, API documentation, Java version.
- pick 4 sections out of 52(!) (eg. those ending in
'_client_java')
Tha yields a total of 30 classes, with a total of ~170 methods
(not counting c'tors, inherited methods etc.). Assuming that
the pick is representative, for 52 sections one would see >2.000
(!) calls.
I know _I_ would have trouble remembering after one day, that
org.globus.exec.client.GlobusRun.kill()
takes a string as argument, and
org.globus.exec.client.GlobusRun.terminateJob()
takes a GramJob instance. Or to remember the name of this
function:
GramJob.populateStagingDescriptionEndpoints()
Do I need to call that method? When? Why? Does that call work
against Globus-2.x? Globus-4.0? Globus-4.2? (Gram versions
changed w/o being backward compatible).
And, just for fun, from another package:
setManagedJobPortTypePortWSDDServiceName (java.lang.String)
:-))
Lets compare that to Amazon's cloud API: it has 34 API calls
(there are no inherited calls). About half of the calls are for
setting description, setting securities, etc. So, remain On the other hand, the economic systems that apply in the two
> cases are extremely different:
>
> * Amazon services are designed to support the masses, they
> have no political constraints on who they can provide service
> to, and their charging model provides strong return to scale;
> thus, Amazon can focus on, and succeed in providing,
> modest-scale, reliable, on-demand service to many.
>
> * TeraGrid (to use a US example) is designed to support a
> small number of extreme computing users, with a negative
> return to scale (the more users, the more work for fixed
> budget); thus, they are not motivated to provide
> virtualization solutions or to operate highly reliable remote
> access interfaces.
I may well be misunderstanding the mission of TeraGrid (and who
am I to argue with one of its steering committee members ;-),
but the TeraWiki and other official TeraGrid sites state
prominently:
"TeraGrid's Vision: Deep, Wide, Open"
I see similar mission statements for other large Grids (I
remember that, for example, the continuation of EGEE funding was
once coupled with the promise to expand the datagrid user group
well beyond the scope of high energy physics. And, in general,
I assume that Grids are not only there for the (relatively)
small group of high performance applications.
But lets assume that a specific Grid indeed serves that specific
community of high performance users: yes, right, there probably
have no incentive to lower the entrance barrier for those.
These users need fine control and rich semantics to get their
peak performance. That does not mean, however, that Grids are
successful and easy to use for everybody else, or in fact for
anybody else...
Again, as said above: by adding a simplier higher level
interface, like Nimbus, on top of the TeraGrid, which is narrow
in the sense that it supports only a single (or small number of)
usage mode(s), but generic enough to be useful to a large
application class, effectively turns TeraGrid into a Cloud.
That way, TG would not loose a single user (the rich 'native'
interfaces are still present), but would have a huge potential
to be attractive to the _average_ scientist. I see this as a
win-win situation, with no negative impact on any side, and with
minimal cost impact.
> The implications of these different foci for users are
> tremendous. On EC2, I give my credit card and start a VM--a
> few seconds. On TeraGrid, I request an allocation (which may
> not be granted!), get an account, submit a request to run a
> job (they won't allow me to start a VM), wait in the queue--a
> many week process. Furthermore, I sometimes find that the
> remote access interfaces fail because keeping them running is
> not high priority.
Is that the vision of Grid computing as a pervasive, ubiquitous
resource for everybody we all had back then in the times of your
and Carl's first Grid Book?
> This alternative perspective is I think more revealing about
> the sources of the differences and the ways we might address
> them. If we want on-demand, high-quality, compute and storage
> services, then we need either to create an economic system in
> which academic providers are motivated to provide such
> services, or decide to outsource to industry.
You touch an interesting point here: many people I met argued
that the real defining feature of a Cloud is not technical, but
rather is its business model, which then dictates all other
Cloud attributes.
I find that perspective intriguing, for its simplicity, but
can't make up my mind: it is so unsatisfying to consider
academia to never be able to achieve a similar ease of use for
compute resources, solely because we do not (and should not
IMHO) follow that business model.
> The importance of higher-level interfaces is a separate issue.
> Yes, tools like Hadoop and Swift for data analysis, Introduce
> for service authoring, Taverna for service composition are
> important and necessary. Yes, we should be hoping to leverage
> and influence work done in the far larger corporate market to
> our advantage. (A focus of the upcoming CCA workshop:
> www.cca08.org.)
Great workshop topic btw, we submitted, too ;-)
> 3) "Grids as currently designed and implemented are difficult to
> interoperate." The authors make a big deal of this point, but
> it is not clear to what purpose.
>
> It is true that interoperation is not automatic. [If only
> everyone used Globus software, then all would be well :)
:-D
Or, as Tannenbaum put it once: 'The nice thing about standards
is that you have so many to choose from'.
> -- although of course the policy issues would remain]. But I
> am not sure that this is a significant problem for users, or
> hard to achieve when it is needed. E.g., the caBIG team
> recently demonstrated a gateway to TeraGrid. The LHC
> Computing Grid integrates resources worldwide. Etc. Most users
> never ask about interoperability, in my experience.
Interop has two dimensions in our opinion: system
interoperability (a Grid can utilize resources of another Grid),
and application interoperability (an application written for
Grid A can also run on Grid B, w/o major changes). We mostly
focus on the second one, and should make that more clear in the
paper. Thanks for pointing that out.
So, to conclude: I strongly object your comment that Grids are
successful. Sure, it is always depending from your metric for
success, but taken both the original Grid vision and the mission
statements of prominent Grid infrastructures, I simply cannot
agree with you. The other points are well taken.
Thank you very much for your thoughts - they help to put a
wider perspective to our argumentation.
Best wishes,
Andre.
(PS: puh, that got longer than expected/intented, sorry for that...)
Posted by: Andre Merzky | September 19, 2008 at 03:32 AM
> immediately to mind are the Earth System Grid, cancer Biomedical
> Informatics Grid, and the LIGO Scientific Collaboratory, but as it
Yes, there are wonderful examples of successful Grids.
We have two immediate points to make:
i. It is critical to note that all successful Grids examples provided
are, by our definition *narrow* Grids -- which we state several times
are arguably, the only Grids fit for purpose. Thanks for helping make
our point!
ii. Notable, if only by omission from the list, was mention of the
TeraGrid.
But a few positive counter examples do not change the overall state of
despair, dysfunction and disrepute. We are not arguing that Grids as
a concept have failed. We are saying that at some point along the
evolution, Grids have become difficult to use as distributed
systems. We agree the reasons are complex and we are the first to warn
against over-simplified reasons.
> No evidence is provided for this assertion that complex interfaces
> are the reason for the difficulties people have with grids. I argue
> that the issues are more complex.
True, we don't provide much evidence, other than our documented
experiences talking to the community about what they feel are some of
the challenges.
> First, the interfaces themselves are not, in my view, a significant
> issue.
It is not about the technology (REST vs WS vs. Nimbus etc), it is
about the level of detail being exposed.
Note that we are not arguing that Globus is not fit for its purpose -
it probably is. We argue, that if you take a Globus based Grid, and
implement something like Amazon's EC2's API on top of it, along with
the SLAs, usage policies, and business model, you turn that Grid into
a Cloud, as the _exposed_ semantics would be limited, and it would
focus on a much smaller set of usage modes (i.e for specific
application classes), which would then be very easy (i.e. trivial) to
use. Rinse and repeat for other application classes, i.e. for clouds
with other usage modes (Amazons Storage, Queuing, DB clouds etc.)
> * TeraGrid (to use a US example) is designed to support
> a small number of extreme computing users, with a negative
> return to scale (the more users, the more work for fixed
> budget);
Also it is important to think in terms of consistent/sustained use
general-purpose Grids (or the lack thereof) as distributed systems
versus occasional heroic efforts. If the TeraGrid is used as
distributed system, only from one heroic effort to another and not in
between, than that is a serious issue for all concerned. We would
love to hear of science projects (even extreme users) that have used
the TeraGrid as a distributed system, as opposed to lumps of big-iron,
and produced scientific results (that might not have been possible
otherwise) *without a legion of support staff/resource providers
backing their efforts*?? Once again there might at best be a couple
of positive counter-examples, but at best there will be a few. That
is why it is important to separate the heroic from the routine.
> 3) "Grids as currently designed and implemented are difficult to
> interoperate." The authors make a big deal of this point, but it is
> not clear to what purpose.
Interop has two dimensions in our opinion: system interoperability (a
Grid can utilize resources of another grid), and application
interoperability (an application written for Grid A can also run on
grid B, w/o major changes). We are mostly concerned with the latter
("As an application developer/user, what do I care if it is a Grid, a
Cloud, or a Grid-of-Clouds, Clouds-of-Grids...."), and we
mention "application-level interoperability", though
we could make that more explicit in the paper.
> Most users never ask about interoperability, in my experience.
There is an element of cause vs effect. Its currently so difficult
at the application level, that hardly anyone dare ask for it.
Posted by: S jha | September 20, 2008 at 07:28 PM
In Shantenu's thoughtful response, he uses the number of classes and methods as a measure of the relative complexity of GRAM and EC2. I disagree with this measure,for two reasons:
A) When he contrasts GRAM Java APIs and EC2 APIs, he is comparing apples and oranges.
1) GRAM and EC2 don't do the same thing. If you want to start a virtual machine (a far simpler task that starting, monitoring, and controlling a job) then you can use Nimbus. See http://workspace.globus.org/clouds/cloudquickstart.html for their interface. (It also supports EC2, if you like that!)
So the Globus and EC2 interfaces for doing the same thing are of comparable complexity.
2) Even when comparing apples and oranges, one should look at the right interfaces. GRAM has extremely simple interfaces for people that want to do simple things. E.g., see:
http://vdt.cs.wisc.edu/releases/1.3.7/submitting_wsgram_jobs.html
It also has more complex interfaces for people that want to do other things, like delegate credentials, optimize job flow, etc. Shantenu is right that these interfaces exist. He is wrong in his implication that people need to know about them. Most GRAM users never look at them. Some do, and for them they are important.
B) I agree that TeraGrid is not a good source of on-demand computing cycles, for two reasons:
1) It supports only job submission, not VM creation, and that creates big problems for many users. We've been asking for some years now that they support VMs, but they don't want to do so, for reasons both good and bad.
2) It's scheduling and operations policies are not appropriate. Batch scheduling policies are not well suited for people that want computing on demand. The operations policies of TeraGrid sites don't tend to provide the availability that users need for outward-facing services like GRAM and GridFTP. Same comments apply as for (1).
These are implementation/operation issues, not interface issues--as I argued in my original post.
Posted by: Ian Foster | September 21, 2008 at 10:48 AM