Ed Walker wrote a nice article last year in which he used the well-known NAS parallel benchmarks to compare the performance of a commercial infrastructure-as-a-service offering (Amazon EC2) with that of a high-end supercomputer (the National Center for Supercomputing Applications "Abe" system). Not surprisingly, the supercomputer was faster. Indeed, it was a lot faster, due primarily to its superior interprocessor interconnect. (The NAS benchmarks, like many scientific applications, perform a lot of communication.)However, before we conclude that EC2 is no good for science, I'd like to suggest that we consider the following question: what if I don't care how fast my programs run, I simply want to run them as soon as possible? In that case, the relevant metric is not execution time but elapsed time from submission to the completion of execution. (In other words, the time that we must wait before execution starts becomes significant.)
For example, let's say we want to run the LU benchmark, which (based on the numbers in Ed's paper) when run on 32 processors takes ~25 secs on the supercomputer and ~100 secs on EC2. Now let's add in queue and startup time:
- On EC2, I am told that it may take ~5 minutes to start 32 nodes (depending on image size), so with high probability we will finish the LU benchmark within 100 + 300 = 400 secs.
- On the supercomputer, we can use Rich Wolksi's QBETS queue time estimation service to get a bound on the queue time. When I tried this in June, QBETS told me that if I wanted 32 nodes for 20 seconds, the probability of me getting those nodes within 400 secs was only 34%--not good odds.
So, based on the QBETS predictions, if I had to put money on which system my application would finish first, I would have to go for EC2.
Here is a more detailed plot showing cumulative probability of completion (the Y-axis) as estimated by QBETS as a function of time since submission (the X-axis). We see that the likelihood of my application competing on EC2 is zero until around 400 seconds, when it rapidly rises to one. For the supercomputer, the probability rises more slowly, peaking at around 0.97. (I would think that the fact that the supercomputer estimate does not reach 1 relates to the lack of data available to QBETS for long duration predictions.)
Note that in creating this graph, I do not account for application-dependent startup time on the supercomputer or for any variability in the startup time of the EC2 instances. (Looking at Cloudstatus, the latter factor seems to be relatively minor.)
This result reflects really just the scheduling policies (and loads) that the two systems are subject to. Supercomputers are typically scheduled to maximize utilization. Infrastructure as a service providers presumably optimize for response time.
Nevertheless, these data do provide another useful perspective on the relative capabilities of today's commercial infrastructure-as-a-service providers and supercomputer centers.