This information is for a release that is no longer supported by the Globus Toolkit. The currently supported versions of the Globus Toolkit are 4.2 (recommended) and 4.0.
WS GRAM: Developer's Guide
Overview
GRAM slides [ html ] [ pdf ]
API
Architecture
>Fault Tolerance Architecture
RSL Schema
MJS Fault Types
Samples
Scheduler tutorial
Troubleshooting
Fault Tolerance Architecture
WS GRAM fault tolerance is managed by a coordinated set of serviceContainer handlers and background sweeper tasks in both the MHE and UHE. Below is a diagram with descriptions of how it is accomplished.
For more information on GRID service persistence in the OGSA core framework, refer to the Java Programming Guide.
UHERestartHandler in the MHE
The UHERestartHandler is containerHandler in the Master Host Environment (MHE) that is installed when the MasterManagedJobFactoryService is installed. It is configured to run once when the MHE is started. The UHERestartHandler attempts to restart any User Host Environments (UHEs) that were previously started by the MHE. For each UHE entry in the jobManPortMapping, it uses the startLocalHostingEnv java method in Setuid.java to start the UHE. This is the same method used by the MMJFS to start a UHE.
UHESweeperTask in the MHE
UHESweeperTask is a background task running in the MHE. It is configured to run every two hours. It is responsible for restarting "crashed" UHEs and updating the jobManPortMapping file when a UHE is shutting itself down due to inactivity.
To determine if a UHE has crashed, for each entry in the jobManPortMapping, the UHESweeperTask pings the UHE's admin service to determine if it is still alive. If the ping fails, the UHESweeperTask restarts the UHE on the same port. If the restart fails, no further attempts are made until the sweeper task is run again.
To identify UHEs that are shutting down, the UHESweeperTask does a findServiceData call for UHEActivity SDE in the UHE's Admin Service. If the SDE is "Shutdown" then the jobManPortMapping file is purged of the entry for that UHE.
SimpleTimestampHandler in the UHE
This logs a timestamp everytime a message comes into the UHE. This handler gets installed in the UHE at creation time.
UHEActivityTask in the UHE
This is a UHE background sweeper task. It monitors the activity of UHE and sets the UHEActivity SDE in AdminService accordingly. The UHEActivityTask SDE can have two values, "Active" or "Shutdown". Active is set when the UHE has had a recent message or an active service instance. Shutdown is set when the UHE has no active service instances and no messages have been received in the past 2 hours. The UHE is actually shutdown the *next* time the UHEActivityTask is run.
MHE services
The persistent service in the MHE are restarted after the UHERestartHandler is run. Typically, this includes one or more MMJFSs and a RIPS.
UHE services
The persistent services in the UHE are MJFSs, MJS instances, FileStreamFactoryServices, and FileStreamService instances. Once a MJS instance has been recovered, it resumes monitoring of the User's job.
Testing WS GRAM Fault Tolerance
