Thursday, December 28, 2006

Supporting Software Development in Virtual Enterprises

Overview of the DHT approach

Our approach to project coordination and sharing of project artifacts is implemented in a framework that employs two complementary forms of information integration:

* logical integration provides a view of the shared information space based on a virtual central artifact repository to facilitate project coordination
* physical integration provides transparent access to objects that appear in the virtual repository, but are actually stored and managed in autonomous, distributed, heterogeneous repositories.

The structure of the virtual repository is described with a semantic hypertext data model. [NS91] Hypertext is an information management concept that organizes data into content objects called nodes, containing text, graphics, binary data, or possibly audio and video, that are connected by links which establish relationships between nodes or sub-parts of nodes. The resulting directed graph, called a hypertext corpus, forms a semantic network-like structure that can capture rich data organization concepts while at the same time providing intuitive user interaction via navigational browsing.

The DHT hypertext data model augments the traditional node and link model with aggregate constructs called contexts that represent sub-graphs of the hypertext, and dynamic links that allow the relationships among nodes to evolve automatically as artifacts are created and modified. The DHT data model defines the structure of objects in the global hypertext, and the operations (including updates) that may be performed on them.

DHT achieves physical integration with a client-broker-server architecture that provides transparent access to heterogeneous repositories through intermediary information brokers we call transformers. Clients are software tools (or engineering environments) that developers use to access objects concurrently in server repositories.

Over the past five years that the DHT prototype has evolved, about a dozen different types of software development tools and heterogeneous software repositories have been integrated to run within DHT.

2.1 Architecture

The DHT architecture is based on a client-broker-server model. [ACDC96] Clients implement all application functionality that is not directly involved in a server's storage management. Thus, a client is typically an individual tool, but may be a complete software development environment.

Software artifacts are exported from their native repository through server transformers. A transformer is a kind of mediator that exports local objects (artifacts and relationships) as DHT nodes and links, and translates DHT messages into local repository operations (Figure 2). Transformers run at the repository site, typically on the same host as the repository; thus, from the repository viewpoint, the transformer appears to be just another local tool or application.



A request-response style communication protocol implements the operations specified in the DHT data model, [NS91, NS94] and includes provisions for locating transformers and authenticating and encrypting messages. The protocol also provides a form of time stamp-based concurrency control [KS86, NS94] to track and prevent 'lost updates'.

Our experience has been that transformers for new repositories can be developed with modest effort (i.e. hours to days), based on reusable server templates that are augmented with code to interact with specific repositories.

2.2 Data model

The DHT data model consists of three types of primitive objects: nodes, that represent content objects such as program modules or project documents; links that model relationships among nodes; and contexts, that enumerate sets of links to allow specification of object compositions as sub-graphs. Nodes, links and contexts are all first class objects having types, attributes and unique object identifiers (OIDs) associated with each. In addition, links have anchors, that specify regions or sub-components within node contents to which the endpoints of a link are attached.

Contexts enumerate, but do not actually contain, links. Thus, a given link can be a member of several contexts, making it possible to compose different views of the same objects by imposing different structures or configurations as described by links among those objects. Contexts are also first class objects, and so may serve as the endpoints of links.

A fixed set of operations can be applied to DHT objects: create, delete, read and update an object. The owners or administrators of a given repository can elect to provide any subset of these operations (e.g. segmented by user groups, network location, or by type of client), as appropriate for the level of access they intend to offer. In addition, any operation can be performed by a single repository on its own objects. Cooperation among repositories is not required. Therefore, the DHT model preserves a high degree of local repository autonomy.

3 Tool integration

Whether artifacts are stored in a real or virtual repository, software developers create, manipulate and configure shared artifacts using software tools and environments. Many of these objects will exist before the virtual enterprise is formed, and thus before integration by DHT. It is impractical to expect developers and organizations to abandon their favorite tools to use new tools that can access a DHT corpus. Therefore, DHT includes a strategy for migrating existing and new tools to the DHT environment, and a configurable cache mechanism to enable alternative approaches for creating access to, and controlling concurrent updates to, collaborative information spaces.

The migration strategy specifies five levels of integration:

* Level 0. At the foundation level DHT provides a process integration capability [cf. MS92] that enables the configuration (via incremental modeling) and binding of individual developers to development roles, process tasks, and product components to appropriate (client-side) tools. During process prototyping the choice of tool(s) may be unspecified (no tool) or specified by class name or similar place holder (tool stub or bitmap), which enables process walkthrough or simulation. [S96, SM97] To support process enactment, executable tools must be bound to corresponding task steps in order to be invoked on the specified product component.

* Level 1. At this level tools are not integrated at all. They exist unmodified, or as 'helper applications', and require auxiliary tools (e.g. Web browsers) to interact with DHT on their behalf. Auxiliary tools simply perform node retrieval and update, and link resolution, to and from a tool's standard input/output, or files in the local file system. The use of a Web form-based interface to an existing relational data base management system would be an example.

* Level 2. Integration at this level treats DHT nodes as file-like objects. Tools use file system calls like open(), read(), write(), etc., to access a node's contents, passing a string representation of the node's OID rather than a file pathname. Level 2 integration can be accomplished without recompiling or modifying source code; simply relinking the tool with a DHT compatibility library (described below) is all that is required. Note, however, that Level 2 tools do not have knowledge of DHT links.

* Level 3. At this level a tool is aware of links as relationships among objects, and can follow them. This awareness does not appear at the user interface. An example of a level 3 tool is a document compiler that resolves links of type 'include' to incorporate text from other nodes into a source node.

* Level 4. Last, at this level a tool integrates hypertext browsing and linking into its user interface. This may require extensive modification to the tool's source code. Fortunately, many tools and environments incorporate extension languages or escapes to external programs that can be used to implement linking without re-compilation. For example, this technique was used to implement the DHT editor using GNU Emacs Lisp. Process modeling and enactment are supported at Level 0, and this is described later. It can be used together with any other level of tool integration.

Levels 0 and 1 provide 'facade-level' integration of tools at the user interface. Levels 2 to 4 provide increasing scope for data integration capabilities. Control integration of the kind represented by the use of a software bus or similar message/event broadcast mechanism are not provided, however. As Reiss [R96, p. 405] observes, control integration forces tools to have a common reference framework, which is typically a file name and line number. In this regard, the Level 2 integration scheme for file system emulation could therefore be used to support compatibility with such a control integration scheme. The following sub-sections expand on the role of DHT's file system emulation scheme and object caching framework.

A vast legacy of software development tools and environments use the file system as their repository. These applications read and write objects as files through the file system interface, typically by calling standard I/O library routines supplied for the tool's implementation language. Our goal to provide a reasonable cost implementation strategy precludes requiring that all of these tools be modified to use the DHT tool/application interface in place of the file system library.

To solve this problem, the DHT architecture exploits the file-like nature of DHT atomic nodes to provide a file system emulation library. This library intercepts file system calls and converts them to DHT access operations when strings encoding DHT object identifiers are passed as the pathname argument. As an example, the entry points for the Unix version of this library.

To enable a tool for DHT access, one simply re-links it with the emulation library. Thus, the tool will continue to function as before when invoked with real file names, yet will retrieve contents from the DHT object cache (described below) when DHT object identifiers are used.

3.1 Object caching

Many software development artifacts which DHT manages change slowly, while others see frequent access during a short period of time. To facilitate collaborative information sharing, and to improve access latency and reduce transformer loads, we have found it desirable to cache frequently used objects, especially those from repositories accessed over the Internet.

A cache layer is built into the the basic request interface to provide transparent node and link caching. The cache is maintained in the local file system; node contents are cached in separate files to support the file system emulation library discussed above, while links and node attributes are cached in a hash table. Clients call a set of object access functions to retrieve objects through the cache layer.

Each DHT object has a 'time-to-live' attribute that specifies the length of time an object in the cache is expected to be valid. The cache layer uses this attribute, set by the transformer when the object is retrieved, to invalidate objects in the cache upon access. An administrative function periodically sweeps through the entire cache to remove objects whose time-to-live has expired.

The time-to-live attribute is not a guarantee of validity, however. Certain shared objects may be updated frequently by multiple clients. To allow such clients to verify that requested objects have not been modified by another client, the cache layer can be configured with different cache policies to support specific tool/application needs:

* Never use the cached copy; always retrieve an object from the repository.
* Use the cached copy if its time-to-live has not expired.
* Use the cached copy if it has not been modified; verify this by retrieving the object's time stamp from the repository.
* Always use the cached copy if present, regardless of its time-to-live or time stamp.

The cache interface layer does not automatically write updates through to the repository. Instead, a separate function DhtSync() causes the cache to send an update request to synchronize the cached copy with that in the repository. This enables DHT integrated tools to tailor access to the cache for different policies for concurrent object access/update. This is especially important when dealing with legacy repositories for software development that impose different user workspace models. Therefore, when using DHT, we need not endorse some particular workspace model as 'best in all circumstances' and thus we can avoid or mitigate some of the costs of transitioning to a different workspace model.



As indicated in Figure 3, by delaying synchronization and specifying the non-validating cache policy, the cache can be used as a 'local workspace'. Objects, once placed into the cache, are read and updated locally, and thus are not affected by updates from other developers. A 'sweep' application periodically synchronizes the cached copies, possibly invoking tools that will merge objects that have changed in the interim.

Alternately, updates can be written-through immediately, by calling DhtSync() after each update operation. [cf. BHP92] This, coupled with the verifying cache policy, can be used to implement a 'shared workspace' policy for development (Figure 3), in which each developer sees updates from other developers upon object access.

To simulate a 'RCS-style' of version controlled development, in which developers obtain a transaction or exclusive write access to an object through locking, a lock attribute must be added to objects. The lock is bound to the user-ID of the developer who seeks to control the object. The DHT concurrency mechanism ensures that only one developer can set the lock, which is cleared when the object is 'released'. However, applications must cooperate by not modifying objects unless they have successfully set the lock attribute; there is no way to enforce the lock by denying updates if someone insists on updating an object. This policy can be coupled with the validating or non-validating cache policy, depending on the preferences of the developer.

Taken together, the multi-level scheme for integrating new/legacy tools, and support for different policies for configuring object sharing workspaces, provides DHT will an ability to configure and coordinate collaborative workspaces within a project. These workspaces can then be accessed concurrently and updated using tools familiar to distributed developers, even though the individual tools and object repositories may either lack such support, or implement different policies for sharing access and synchronizing updates. Nonetheless, the challenge remains of how best to support cache consistency in the light of the need to maintain autonomy conditions.