I mentioned the upcoming Workshop on Many Tasks Computing on Grids and Supercomputers. (Submission deadline: August 15.) What, you may wonder, do we mean by "many tasks computing"?
I (and my co-organizers Ioan Raicu and Yong Zhao) use the term to denote high-performance computations comprising multiple distinct activities, coupled via (for example) file system operations or message passing. Tasks may be small or large, uniprocessor or multiprocessor, compute-intensive or data-intensive. The set of tasks may be static or dynamic, homogeneous or heterogeneous, loosely or tightly coupled. The aggregate number of tasks, quantity of computing, and volumes of data may be extremely large.
Were we right to coin a new term, multi task computing, to denote such applications? There are certainly alternatives that we could have used instead. For example:
- Multiple Program Multiple Data (MPMD): A variant of Flynn's original taxonomy, used to denote computations in which several different programs each operate on different data at the same time. (In contrast to SPMD, in which multiple instances of the same program each execute on different processors, operating on different data.) Not a bad term, really, although in our case, the set of tasks can vary dynamically. Maybe we should say dynamic MPMD?
- Heterogeneous applications: A computation that involves multiple, different parts. Not a bad term, but rather unspecific. Perhaps a synonym for MPMD?
- High throughput computing (HTC): A term coined by Miron Livny to contrast workloads for which the key metric is not floating point operations per second (as in high performance computing: HPC) but "per month or year." We didn't use that term because the applications we work with are often are just as concerned with performance as the most demanding HPC applications--they want to run in minutes or hours, they just don't happen to be SPMD programs.
- Workflow: Surely one of the most abused terms in computing, workflow was first used to denote sequences of tasks in business processes, but is sometimes also used to denote any computation in which control passes from one "task" to another. I find its use to describe many task (or MPMD or heterogeneous or ...) computations an unwarranted perversion of the English language.
- Capacity computing: A term used to denote a computing resource designed to support many small tasks --in contrast to a capability computing resource, on which a single large computation can run efficiently. I see the same problem here as with HTC: many task computations, while heterogeneous, can be extremely large and can place great demands on a computing system.
- Embarrassingly (or happily) parallel: A delightful term used to denote parallel computations in which each individual (often identical) task can execute without any significant communication with other tasks or with a file system. Certainly some "many task applications" will be simple and happily parallel. But others will be bothersomeingly complex and communication intensive, interacting frequently with other tasks and/or a file system.
Are we making a useful distinction in using the term "many task computing" rather than one of those above, or just engaging in unnecessary neologism? Tell me what you think!
Perhaps we could simply have said: applications that are communication-intensive but are not naturally expressed in MPI. In that sense (and this is really the primary goal of the workshop) we are simply drawing attention to the many computations that are heterogeneous but not "happily parallel." Such computations can arise for a variety of reasons, such as:
- Individual tasks are themselves parallel programs.
- Many tasks operate on the same input data, and we can use the fast network to broadcast that data to all nodes, or to distribute references to data subsets if that is more efficient (or if the data cannot be replicated)
- There is considerable communication between tasks.
- There is a need for substantial distributed data reduction operations prior to output.
In any case, we hope to see submissions from people working on high throughput computing, data-intensive scalable computing, and any other sort of high-performance computing that isn't conventional SPMD.
"Pleasantly Parallel" is a popular term that I've heard in place of "Embarrassingly Parallel"
Posted by: Alliterative Alligator | July 09, 2008 at 04:45 PM
We agree that the term "Workflow" has been misused by many in Distributed Systems. However, the terminology was coined for representing tasks (or bag of tasks) with dependencies (DAG would be just one model without cycles and loops). The computation and data characteristics would depend on the type of application modeled as a workflow. Even applications based on MTC, could be well represented as a workflow to properly manage and represent data flows between tasks.
The description "unwarranted perversion of the English language" is a bit harsh to those who coined the word in the first place (e.g. ISI Group of USC), no offense.
Posted by: Suraj Pandey | November 11, 2009 at 09:29 PM
Hi Suraj:
Thanks for your comment.
The term workflow long predates the recent use of the term to refer to directed acyclic graphs. E.g., see http://en.wikipedia.org/wiki/Workflow.
A question to ponder: if a DAG is a workflow, what program (sequential or parallel) is not a workflow?
Regards -- Ian.
Posted by: Ian Foster | November 11, 2009 at 09:36 PM