Mike Wilde
Argonne National Laboratory
The Virtual Data Grid: A New Model and Architecture for Data-Intensive Collaboration

The NSF-sponsored Grid Physics Network project – GriPhyN – couples four large-scale experiments in physics and astrophysics (ATLAS, CMS, LIGO, and SDSS) with a team of computer scientists, to create approaches to harness the Grid for data intensive scientific collaborations. GriPhyN’s goal is to create a “Virtual Data Grid” which enables scientists to leverage large-scale Grid resources with greater ease, creating a powerful environment for data analysis and management that facilitates discovery, collaboration, composition, and validation.
Virtual data describes in an abstract, machine-independent manner the graph of data transformations that are applied to an input set of data objects to create some specific output set of objects. Such descriptions serve both as a log of what has been done, and a specification of future work to perform. Virtualizing data helps us better understand what the data represents – both its meaning and its validity. Applying a virtual data approach scientific data analysis enables us to create an environment in which the derived data objects in a scientific data collection are automatically annotated with a precise description of how each object was produced – recording input datasets, transformations, and output datasets for each result. We express these relationships in a virtual data language, VDL, whose definitions are stored in virtual data catalogs. These catalogs can be queried to find data-production “recipes” based on datasets, argument values, or other metadata. Using these recipes as patterns, new data analysis avenues and approaches can be explored, documented, validated, cataloged, and shared.
This talk will describe the underlying concepts of virtual data, our current implementations of a virtual data toolkit, and our experiences in applying these tools and methods to several Grid-based challenge problems in the GriPhyN experiments and related data intensive efforts.