Table of Contents
- 1. Introduction
- 2. Before you begin
- 3. Architecture and design overview
- 4. Public interface
- 5. Usage scenarios
- 6. Tutorials
- 7. Debugging
- 8. Troubleshooting
- 9. Related Documentation
- 10. Internal Components
This guide is intended to help a developer create compatible WS GRAM clients and alternate service implementations.
The key concepts for the GRAM component have not changed. Its purpose is still to provide the mechanisms to execute remote applications for a user. Given an RSL (Resource Specification Language) job description, GRAM submits the job to a scheduling system such as PBS or Condor, or to a simple fork-based way of spawning processes, and monitors it until completion. More details can be found here:
New Features new since 3.2
Support for mpich-g2 jobs:
- multi-job submission capabilities
- ability to coordinate processes in a job
- ability to coordinate subjobs in a multi-job
- Publishing of the job's exit code
- The ability to select the account under which the remote job will be run. If a user's grid credential is mapped to multiple accounts, then the user can specify, in the RSL, under which account the job should be run.
- Optional client-specified hold on a state. Released with the new "release" operation.
Other Supported Features
- Remote job execution and management
- Uniform and flexible interface to batch scheduling systems
- File staging before and after job execution
- File / directory clean up after job execution (after file stage out)
- managed-job-globusrun has been replaced by globusrun-ws.
- Service managed data streaming of job's
- File staging using the GASS protocol
- File caching of stages files, e.g. GASS Cache
Tested platforms for WS GRAM:
- Fedora Core 1 i686
- Fedora Core 3 i686
- Fedora Core 3 yup xeon
- RedHat 7.3 i686
- RedHat 9 x86
- Debian Sarge x86
- Debian 3.1 i686
Tested containers for WS GRAM:
- Java WS Core container
- Tomcat 4.1.31
Protocol changes since GT version 3.2:
- The protocol has been changed to be WSRF compliant. There is no backward compatibility between this version and any previous versions.
GRAM depends on the following GT components:
- Java WS Core
- Transport-Level Security
- Delegation Service
- MDS - internal libraries
GRAM depends on the following 3rd party software. The dependency exists only for the batch schedulers configured, thus making job submissions possible to the batch scheduling service:
Scheduler adapters are included in the GT 4.0.x releases for these schedulers:
Other scheduler adapters available for GT 4.0.x releases:
The GRAM services in GT 4.0 are WSRF compliant. One of the key concepts in the WSRF specification is the decoupling of a service with the public "state" of the service in the interface via the implied resource pattern. Following this concept, the data of GT 4.0 GRAM jobs is published as part of WSRF resources, while there is only one service to start jobs or query and monitor their state. This is different from the OGSI model of GT3 where each job was represented as a separate service. There is still a job factory service that can be called in order to create job instances (represented as WSRF resources). Each scheduling system that GRAM is interfaced with is represented as a separate factory resource. By making a call to the factory service while associating the call to the appropriate factory resource, the job submitting actor can create a job resource mapping to a job in the chosen scheduling system.
The Managed Executable Job Service (MEJS) relies on a state machine to handle state transitions. There are two sets of states: external and internal. The external states are those that the user gets in notifications and can be queried as a resource property. The internal states are those that are strictly used by the state machine to step through all the necessary internal tasks that need to be performed for a particular job.
The Managed Multi-Job Service does not rely on a state machine, but instead makes judgements after receiving notifications from the sub-jobs about which external state it should be in. The external states for the MMJS are identical to the ones used by the MEJS.
Here is a diagram illustrating the internal state transitions of the Managed Executable Job Service and how the external states are triggered within this progression: Managed Executable Job Service Internal State Transition Diagram.
The semantics and syntax of the APIs and WSDL for the component, along with descriptions of domain-specific structured interface data, can be found in GT 4.0 Component Guide to Public Interfaces: WS GRAM.
The following is a general scenario for submitting a job using the Java stubs and APIs. Please consult the Java WS Core API, Delegation API, Reliable File Transfer API, and WS-GRAM API documentation for details on package names for classes referenced in the code excerpts.
The following is a general scenario for submitting a job using the C stubs and APIs. Please consult the C WS Core API, WS-GRAM API documentation for details on package names for classes referenced in the code excerpts.
The following tutorials are available for WS GRAM developers:
For starters, consult the Debugging section of the Java WS Core Developer's Guide for details about what files to edit and other general log4j configuration information.
To turn on debug logging for the Managed Executable Job Service (MEJS), add the
following entry to the
To turn on debug logging for the delegated proxy management code, add the
following entry to the
To turn on debug logging for the Managed Multi Job Service (MMJS), add the
following entry to the
To turn on debug logging for the Managed Job Factory Service (MJFS), add the
following entry to the
To turn on debug logging for all GRAM code, add the following entry to the
Follow the pattern to turn on logging for other specific packages or classes.
Both the service and Java client API code contain special debugging statements which output certain timing data to help in determining performance bottlenecks.
The service code uses the PerformanceLog class to output the timings information. To turn on service timings logging without triggering full debug logging for the service code, add the following lines to the container-log4j.properties file:
log4j.category.org.globus.exec.service.factory.ManagedJobFactoryService.performance=DEBUG log4j.category.org.globus.exec.service.exec.ManagedExecutableJobResource.performance=DEBUG log4j.category.org.globus.exec.service.exec.StateMachine.performance=DEBUG
The Java client API has not been converted over to using the PerformanceLog class, so the debug statements are sent at the INFO level to avoid having to turn on full debug logging. To turn on client timings logging without triggering full debug logging for the client code, add the following line to the container-log4j.properties file:
There are two parsing scripts available in the source distribution that aren't
distributed in any GPT package for summarizing the service and client timings
data. The are located in
ws-gram/service/java/test/throughput/, and are
parse-client-timings.pl. They both simply take
the path of the appropriate log file that contains the timing data. These
scripts work fine with log files that have other logging statements mixed with
the timing data.
It may be necessary to debug the scheduler scripts if jobs aren't being submitted correctly, and either no fault or a less-than-helpful fault is generated. Ideally we would like that this not be necessary; so if you find that you must resort to this, please file a bug report or let us know on the discuss e-mail list.
By turning on debug logging for the MEJS (see above), you should be able to search for "Perl Job Description" in the logging output to find the perl form of the job description that is sent to the scheduler scripts.
Also by turning on debug logging for the MEJS, you should be able to search for "Executing command" in the logging output to find the specific commands that are executed when the scheduler scripts are invoked from the service code. If you saved the perl job description from the previous paragraph, then you can use this to manually run these commands.
There is a perl job description attribute named
logfile that isn't currently supported in the
XML job description that can be used to print debugging info about the
execution of the perl scripts. The value for this attribute is a path to a file
that will be created. You can add this to the perl job description file that
you created from the service debug logging before manually running the script
Beyond the above advice, you may want to edit the perl scripts themselves to print more detailed information. For more information on the location and composition of the scheduler scripts, please consult the WS-GRAM Scheduler Interface Tutorial.
When I submit a streaming or staging job, I get the following error: ERROR service.TransferWork Terminal transfer error: [Caused by: Authentication failed[Caused by: Operation unauthorized(Mechanism level: Authorization failed. Expected"/CN=host/localhost.localdomain" target but received "/O=Grid/OU=GlobusTest/OU=simpleCA-my.machine.com/CN=host/my.machine.com")
$GLOBUS_LOCATION/etc/gram-service/globus_gram_fs_map_config.xmlto see if it uses
127.0.0.1instead of the public hostname (in the example above,
my.machine.com). Change these uses of the loopback hostname or IP to the public hostname as neccessary.
Fork jobs work fine, but submitting PBS jobs with globusrun-ws hangs at "Current job state: Unsubmitted"
Make sure the log_path in
$GLOBUS_LOCATION/etc/globus-pbs.confpoints to locally accessible scheduler logs that are readable by the user running the container. The Scheduler Event Generator (SEG) will not work without local scheduler logs to monitor. This can also apply to other resource managers, but is most comonly seen with PBS.
If the SEG configuration looks sane, try running the SEG tests. They are located in
$GLOBUS_LOCATION/test/globus_scheduler_event_generator_*_test/. If Fork jobs work, you only need to run the PBS test. Run each test by going to the associated directory and run
./TESTS.pl. If any tests fail, report this to the firstname.lastname@example.org mailing list.
If the SEG tests succeed, the next step is to figure out the ID assigned by PBS to the queued job. Enable GRAM debug logging by uncommenting the appropriate line in the
$GLOBUS_LOCATION/container-log4j.propertiesconfiguration file. Restart the container, run a PBS job, and search the container log for a line that contains "Received local job ID" to obtain the local job ID.
Once you have the local job ID, you can find out if the PBS status is being logged by checking the latest PBS logs pointed to by the value of "log_path" in
If the status is not being logged, check the documentation for your flavor of PBS to see if there's any futher configuration that needs to be done to enable job status logging. For example, PBS Pro requires a sufficient
-e <bitmask>option added to the pbs_server command line to enable enough logging to satisfy the SEG.
If the correct status is being logged, try running the SEG manually to see if it is reading the log file properly. The general form of the SEG command line is as follows:
$GLOBUS_LOCATION/libexec/globus-scheduler-event-generator -s pbs -t <timestamp>
The timestamp is in seconds since the epoch and dictates how far back in the log history the SEG should scan for job status events. The command should hang after dumping some status data to stdout.
If no data appears, change the timestamp to an earlier time.
If nothing ever appears, report this to the email@example.com mailing list.
If running the SEG manually succeeds, try running another job and make sure the job process actually finishes and PBS has logged the correct status before giving up and cancelling globusrun-ws. If things are still not working, report your problem and exactly what you have tried to remedy the situtation to the firstname.lastname@example.org mailing list.
The job manager detected an invalid script response
When restarting the container, I get the following error: Error getting delegation resource
Most likely this is simply a case of the delegated credential expiring. Either refresh it for the affected job or destroy the job resource. For more information, see delegation command-line clients.
The user's home directory has not been determined correctly
This occurs when the administrator changed the location of the users' home directory and did not restart the GT4 container afterwards. Beginning with version 4.0.3, WS-GRAM determines a user's home directory only once in the lifetime of a container (when the user submits the first job). Subsequently, submitted jobs will use the cached home directory during job execution.