The Univa guys announced that they have contributed their Data Distribution Manager software to dev.globus as a new open source incubator project.
The following extract from email posted by Steve Tuecke to the DDM/dev.globus mail list summarizes some key features:
The code that Univa just contributed to the dev.globus DDM incubator we are calling "beta". It has been through substantial (several months) testing and performance scaling, and is reasonably stable. But it is not yet what we would consider ready for production use. This version is really intended for experimentation and early evaluation, and to be a basis to start involving a larger community in its evolution and development. Using the Globus Toolkit's definitions of "beta" and "alpha", this would be an alpha version, because the public interfaces will be changing some prior to the 1.0 release. However, its at more of a beta level wrt testing.
Univa is planning substantial additions and changes in the next version. In the coming weeks we'll be posting detailed design documents for the current version (so you can see what the current version is doing under the covers without having to read all the source), as well as the work-in-progress design docs for the next version. We'll also be shifting our engineering team over to doing our work in the dev.globus svn. In other words, the DDM code is now available, and we are in process of opening up the rest of our design docs and engineering processes around DDM.
The biggest changes in the next version will be (1) a move from the
current single master / multiple worker model, to a more resilient
and scalable peer-to-peer/multi-master model, and (2) an extensible
data set description mechanism that can integrate with external
metadata catalogs. There will be copious details on these and all of
the other planned features coming soon.
The following are some of its key features:
- Multi-site replication with updates: DDM does not assume a write- once/read-many model. While it can certainly support that, it is designed to also handle updates to replicated files. And it does so in a multi-site setting, where it is tracking what file contents (not just file names, but also fingerprints of file contents) have been replicated where, with replica selection for choosing the "best" source for a particular request.
- Synchronize only what has changed: It is not just data transfer, but also tracking of file contents and re-synchronization. The sync can be done either on full files (e.g. when a new file is created), or partial files (using rsync under the covers). In some sense, you can think of DDM as a reliable, multi-site rsync service.
- Fault tolerance: Configurable policies for backoff and retry, including with failover to alternate sources (when there are multiple replicas of a file). Of course, the DDM service itself manages its state in a backend database, so that it is also resilient to server failures.
The next version adds:
- Extensible data set descriptions: The next version of DDM will have an extensible data set description model, which can plug into external metadata catalogs via a Java API. This will allow for rich and extensible ways of specifying what you want transferred. For example, I could say "transfer all 'blue' things to site X", where DDM will resolve 'blue' into specific files via an external metadata catalog, and then do all of its multi-site synchronization from that.
- Service availability: The peer-to-peer/multi-master model means that requests can be submitted and monitored in a distributed environment, even when WAN links are down. This gives clients much more reliable access to the DDM service, so they can have higher assurance of their ability to submit and monitor requests.
So what it does for the data landscape is that is provides much richer capabilities for distributed file management that is available in any of the current Globus tools.

Comments