My Photo

« Research in Paradise | Main

August 05, 2009

TrackBack

TrackBack URL for this entry:
http://www.typepad.com/services/trackback/6a00d8341c6ed053ef0120a4c868ee970b

Listed below are links to weblogs that reference What's faster--a supercomputer or EC2?:

Comments

Dan Katz

Once you start looking at using queue prediction, you can also start examining other ways of doing work quickly, such as splitting an MPI job across multiple systems where the small sub-jobs will start faster than a single large job on one system.

See: DOI: 10.1098/rsta.2009.0054 or http://rsta.royalsocietypublishing.org/content/367/1897/2545.full

Jan-Philip Gehrcke

Hello Ian,

interesting comparison and some things came to my mind while reading this:

(0) If you would have shown error bars in the graph, then it wouldn't look this unambiguously. Sorry, I'm physicist and we never trust any graph without error bars. Especially in this one, because the errors in your estimations are very big I think ;-). I at least can tell for EC2, see (1).

(1) While starting VMs on EC2 is often as fast as you considered (around 5 minutes), I also often experienced times around 10 minutes, even for small images. This would change the graph a bit.

(2) I don't know how you have to pay for a super computer, but EC2 instances are billed per hour. It's simply uneconomical to run EC2 instances for less than an hour. Hence, to make a reasonable comparison, one has to set up a benchmark that occupies the EC2 instances for at least one hour. This would change the whole examination dramatically, since then "the time that we must wait before execution starts becomes" less significant compared to the execution time. Then the super computer will take lead again, because of its factor 4 in speed.

(3) Most of the poeple are not that lucky that they've access to a super computer. But they have access to EC2, instantly! I think that is the real advantage of EC2 :-)

Thank you for inspiring this discussion!

Sincerely,

Jan-Philip Gehrcke

Ian Foster

Jan-Philip:

Your comment about error bars hurts--I am always bugging my students on that point :-)

I poked around a bit to see if I could generate error bars easily, but couldn't find a way. As this was meant to be a semi-humorous commentary, I didn't look too hard!

Thanks for your excellent comments.

Regards -- Ian.

Alfonso Olias

Dear Ian
We recently did an study and we proved that Amazon EC2 is suitable for science. Because there are many other variables you have to consider, electricity or storage or sys-admin costs.

I agree thet in terms of speed Amazon EC2 cannot compete with a Supercomputer once the process is running in both systems. Even with a dedicated in-house cluster, as the virtual machines have always an overheard.

I would like you to read this blog about our experiment and the slides.

http://aws.typepad.com/aws/2009/06/scaling-to-the-stars.html

http://www.theserverlabs.com/blog/2009/06/22/the-server-labs-cloud-computing-expo-09-update/

Mukund

Its really important to understand what types of applications are a good fit for public clouds, as they exist today.

Cloud platforms cannot be workload (type) agnostic, with the current state of compute/storage/interconnect design. While cloud providers don't necessarily offer SLA's specific to any particular workload category, it is safe to say that most vendors address typical web application architecture (and then extend their capabilities to other distributed application frameworks such as Hadoop etc). If you can do BI or HPC apps, fine, as long as it meets your needs/budget & SLA's. Recent hardware trends (e.g. multi-core cpu's, 10GE, flexibility due to application level frameworks) more than Moore's law, probably play a dominant role in accommodating a larger set of workloads beyond typical web apps in public clouds. Of course, consolidation of different interconnect technologies (e.g converged network/io adapters), improvements in server/storage/network design & data center organization might make for interesting, workload agnostic clouds, but that's in the future.....

Kent Langley

Please raise your hand if you have access to an actual supercomputer.

Okay...

Please raise your hand if you have access to EC2.

Okay...

If you want to run a test then wether it takes 25sec. or 100sec. or 1000sec. at least you can run your job on EC2 right now for a tiny cost relative to aquiring a supercomputer or supercomputer time. I think this is the most important thing.

More experiments will be possible than ever before to more people. It might not be quite as efficient on a per job basis, but it certainly is effective as a whole and democratizes access to significant computing power for real work.

DD

If many users of a supercomputer typically allocated 32 nodes for 20 seconds all the time, the queue system would be set up in such a way so it would not take long time to get them started.
The reason it would take long time to get it started on a supercomputer would be A) too many users or (most likely) B) most users run longer jobs, so the queue system has been set up in such a way to not prioritize smaller jobs.

Ian Foster

Yes, I allude to this factor when I comment that "This result reflects really just the scheduling policies (and loads) that the two systems are subject to." But this observation does not make the effect any less real.

Matthew Arrott

Ian,

This is a great recap and comparison of execution environments for the class of HPC applications that have been dominate to date: namely batch job submissions that are self-contained and have a finite existence (execution time).

Going forward more and more apps are going to require corroboration with other apps to achieve their desired outcome. More and more of these collaborations will operate in real time using messaging vs. scheduled workflows coordinated with file transfers. To date we have referred to this class of collaborating applications as a Service.

Your posting is interestingly mute on the natural execution environment for a Service, which;
1) has a life cycle of their own out-side any specific HPC client app,
2) needs to be prepared to start executing with mille-, if not, micro-seconds; and
3) needs to scale (up and down) to meet the demand required of it by its client applications.

It is my view that we are now seeing multiple examples of environmental models that are run on a continual/reoccurring basis to produce revised Fore, Now and Hind casts to represent the Current state of an environment for use in decision support applications. These are HPC applications operating as Services.

The HPC model of batch computing is mute on how to support the Service application and the Cloud style resource allocation (deployment model) is currently the only game in town. I suspect, it is almost self-evident the current notion of the HPC-style Job Queue is not going to work for Services. Going forward the Supercomputing centers are going to have to allocate some fraction of their resources to support Services in order to support the next generation HPC applications that leverage them. (Yes - OSG supports this notion)

Thank you very much for this posting. It has definitely helped us better scope/understand the concerns of the HPC community and its lukewarm reception of the cloud deployment model.

Matthew

zinc

Ed Walker had really done great job by writing such a great article. I would like to go with the super EC2.

Verify your Comment

Previewing your Comment

This is only a preview. Your comment has not yet been posted.

Working...
Your comment could not be posted. Error type:
Your comment has been saved. Comments are moderated and will not appear until approved by the author. Post another comment

The letters and numbers you entered did not match the image. Please try again.

As a final step before posting your comment, enter the letters and numbers you see in the image below. This prevents automated programs from posting comments.

Having trouble reading this image? View an alternate.

Working...

Post a comment

Comments are moderated, and will not appear until the author has approved them.