My Photo

« Research in Paradise | Main | What the cloud *really* means for science »

August 05, 2009

Comments

Dan Katz

Once you start looking at using queue prediction, you can also start examining other ways of doing work quickly, such as splitting an MPI job across multiple systems where the small sub-jobs will start faster than a single large job on one system.

See: DOI: 10.1098/rsta.2009.0054 or http://rsta.royalsocietypublishing.org/content/367/1897/2545.full

Jan-Philip Gehrcke

Hello Ian,

interesting comparison and some things came to my mind while reading this:

(0) If you would have shown error bars in the graph, then it wouldn't look this unambiguously. Sorry, I'm physicist and we never trust any graph without error bars. Especially in this one, because the errors in your estimations are very big I think ;-). I at least can tell for EC2, see (1).

(1) While starting VMs on EC2 is often as fast as you considered (around 5 minutes), I also often experienced times around 10 minutes, even for small images. This would change the graph a bit.

(2) I don't know how you have to pay for a super computer, but EC2 instances are billed per hour. It's simply uneconomical to run EC2 instances for less than an hour. Hence, to make a reasonable comparison, one has to set up a benchmark that occupies the EC2 instances for at least one hour. This would change the whole examination dramatically, since then "the time that we must wait before execution starts becomes" less significant compared to the execution time. Then the super computer will take lead again, because of its factor 4 in speed.

(3) Most of the poeple are not that lucky that they've access to a super computer. But they have access to EC2, instantly! I think that is the real advantage of EC2 :-)

Thank you for inspiring this discussion!

Sincerely,

Jan-Philip Gehrcke

Ian Foster

Jan-Philip:

Your comment about error bars hurts--I am always bugging my students on that point :-)

I poked around a bit to see if I could generate error bars easily, but couldn't find a way. As this was meant to be a semi-humorous commentary, I didn't look too hard!

Thanks for your excellent comments.

Regards -- Ian.

Alfonso Olias

Dear Ian
We recently did an study and we proved that Amazon EC2 is suitable for science. Because there are many other variables you have to consider, electricity or storage or sys-admin costs.

I agree thet in terms of speed Amazon EC2 cannot compete with a Supercomputer once the process is running in both systems. Even with a dedicated in-house cluster, as the virtual machines have always an overheard.

I would like you to read this blog about our experiment and the slides.

http://aws.typepad.com/aws/2009/06/scaling-to-the-stars.html

http://www.theserverlabs.com/blog/2009/06/22/the-server-labs-cloud-computing-expo-09-update/

Mukund

Its really important to understand what types of applications are a good fit for public clouds, as they exist today.

Cloud platforms cannot be workload (type) agnostic, with the current state of compute/storage/interconnect design. While cloud providers don't necessarily offer SLA's specific to any particular workload category, it is safe to say that most vendors address typical web application architecture (and then extend their capabilities to other distributed application frameworks such as Hadoop etc). If you can do BI or HPC apps, fine, as long as it meets your needs/budget & SLA's. Recent hardware trends (e.g. multi-core cpu's, 10GE, flexibility due to application level frameworks) more than Moore's law, probably play a dominant role in accommodating a larger set of workloads beyond typical web apps in public clouds. Of course, consolidation of different interconnect technologies (e.g converged network/io adapters), improvements in server/storage/network design & data center organization might make for interesting, workload agnostic clouds, but that's in the future.....

Kent Langley

Please raise your hand if you have access to an actual supercomputer.

Okay...

Please raise your hand if you have access to EC2.

Okay...

If you want to run a test then wether it takes 25sec. or 100sec. or 1000sec. at least you can run your job on EC2 right now for a tiny cost relative to aquiring a supercomputer or supercomputer time. I think this is the most important thing.

More experiments will be possible than ever before to more people. It might not be quite as efficient on a per job basis, but it certainly is effective as a whole and democratizes access to significant computing power for real work.

DD

If many users of a supercomputer typically allocated 32 nodes for 20 seconds all the time, the queue system would be set up in such a way so it would not take long time to get them started.
The reason it would take long time to get it started on a supercomputer would be A) too many users or (most likely) B) most users run longer jobs, so the queue system has been set up in such a way to not prioritize smaller jobs.

Ian Foster

Yes, I allude to this factor when I comment that "This result reflects really just the scheduling policies (and loads) that the two systems are subject to." But this observation does not make the effect any less real.

Matthew Arrott

Ian,

This is a great recap and comparison of execution environments for the class of HPC applications that have been dominate to date: namely batch job submissions that are self-contained and have a finite existence (execution time).

Going forward more and more apps are going to require corroboration with other apps to achieve their desired outcome. More and more of these collaborations will operate in real time using messaging vs. scheduled workflows coordinated with file transfers. To date we have referred to this class of collaborating applications as a Service.

Your posting is interestingly mute on the natural execution environment for a Service, which;
1) has a life cycle of their own out-side any specific HPC client app,
2) needs to be prepared to start executing with mille-, if not, micro-seconds; and
3) needs to scale (up and down) to meet the demand required of it by its client applications.

It is my view that we are now seeing multiple examples of environmental models that are run on a continual/reoccurring basis to produce revised Fore, Now and Hind casts to represent the Current state of an environment for use in decision support applications. These are HPC applications operating as Services.

The HPC model of batch computing is mute on how to support the Service application and the Cloud style resource allocation (deployment model) is currently the only game in town. I suspect, it is almost self-evident the current notion of the HPC-style Job Queue is not going to work for Services. Going forward the Supercomputing centers are going to have to allocate some fraction of their resources to support Services in order to support the next generation HPC applications that leverage them. (Yes - OSG supports this notion)

Thank you very much for this posting. It has definitely helped us better scope/understand the concerns of the HPC community and its lukewarm reception of the cloud deployment model.

Matthew

zinc

Ed Walker had really done great job by writing such a great article. I would like to go with the super EC2.

Alin

HI Ian,

so you tell us that the probability that your job starts and executes in ~400s on a supercomputer is 34%.

what is the probability for EC2 and the 5 minutes? They have a queue policy too. I can bet with you that in the small print they write there is no guarantee for waiting times.

There is another ethical issue with the commercial clouds. They are there to maximize profits for shareholders, not to offer "free" resources based on the merit of the research (some researchers complain about this but I suspect that are the same that do not like the peer-reviewed system).

If you ask in the research community, there are two main complaints about national supercomputing centers.

1. queueing times. affected mainly by the load of the SC, the finite character of the resources and ad-hoc queue policies. I see no reason why a commercial cloud system would not face the same problems, sooner or later.
2. queue length. ad-hoc policy in a SC in order to minimize the queueing time. I do not see a cloud system offering you undefined length queues.

So at the end of the day from a pure computational view as a scientist I face the same problems in both systems. With the advantage that in a SC centre I get charged much less time for my computation.
You may say that the cheaper for the time unit on the two systems is different, cheaper on the cloud.

For a cloud the price will always be established according with the "market" and the profit target of the company.
In a national centre you miss all this overhead, and in principle, longer you use the SC cheaper becomes.

The last point, a national centre offers you access to computational scientists , hpc experts or whatever they are called. In a cloud, what is price for that?

Alin

Ian Foster

Dear Alin:

Thanks for your comments. In reply:

-- I wasn't arguing that EC2 is cheaper than a supercomputer center, just pointing out that turnaround time is as important as execution time.

-- There is a startup time associated with EC2, but it seems fairly small and predictable. The reason that it is predictable is that they overprovision.

-- I agree that the differences in wait time for supercomputers vs. EC2 are due to scheduling policy. True, but it doesn't change that they are different.

-- The notion of an "ethical issue" is (in my view) not tenable. The government or other research sponsors pay for all computing resources used by researchers, whether by giving them funds that they spend on local computers or on Amazon resources, or by buying national supercomputers that are allocated by peer review. I don't see that one method is more ethical than another.

-- We definitely need experts to help people use advanced methods, regardless of where people compute.

Thanks again!

Ian.

The comments to this entry are closed.