GT 5.0.1 Release Notes: GRAM5


1. Component Overview

The Grid Resource Allocation and Management (GRAM5) component is used to locate, submit, monitor, and cancel jobs on Grid computing resources. GRAM5 is not a job scheduler, but rather a set of services and clients for communicating with a range of different batch/cluster job schedulers using a common protocol. GRAM5 is meant to address a range of jobs where reliable operation, stateful monitoring, credential management, and file staging are important.

2. Feature summary

New Features new since 5.0.0

  • Improved support for Condor-G client
  • Fixed memory leaks and improved scalability when restarting many jobs.

Other Standard Supported Features

  • Remote job execution and management
  • Uniform and flexible interface to local resource managers
  • File staging before and after job execution
  • File and directory clean up after job termination
  • Service auditing for each submitted

Removed Features

  • The GRAM5 client tools have dropped support for the Duroc API for task coallocation
  • The GRAM5 service no longer streams output and error during job execution; instead this data is send after the job terminates
  • The GRAM5 service no longer provides intra-job communication via the DUCT API
  • The GRAM5 does not rely on XML schemas and WSDL service definitions

3. Summary of Changes in GRAM5

  • New API function globus_gram_client_register_get_jobmanager_version()
  • Many bug and memory leak fixes
  • Improved reliability with Condor-G clients

4. Bug Fixes

  • GRAM-93: SGE LRM script doesn't handle environment vars with whitespace
  • GRAM-94: SGE LRM doesn't check for executable existence
  • GRAM-95: SGE LRM doesn't check for executable permissions
  • GRAM-96: SGE LRM mishandles invalid environment definition
  • GRAM-103: Ease two phase end commit timeout
  • GRAM-106: Fix test failures with SGE LRM adapter
  • GRAM-108: Fork perl zombies
  • GRAM-112: Test case "globus-gram-client-stdio-size-test" fails but reports success
  • GRAM-113: GRAM5 job manager sends extra state callbacks on restart
  • GRAM-114: globus-gram-protocol-io-test crash with bad creds
  • GRAM-115: globus_gram_protocol_unpack_status_reply crashes if given a null parameter to parse
  • GRAM-116: job_info struct leaked in gram client
  • GRAM-117: Add non blocking form of globus_gram_client_get_job_manager_version()
  • GRAM-120: RSL too large log message has bad format
  • GRAM-128: When restarting a crashed job manager, memory use can be high
  • GRAM-134: globus-gatekeeper does not honor GLOBUS_HOSTNAME
  • GRAM-137: Bad name for syslog in gatekeeper
  • GRAM-141: Jobmanager version command does not delegate
  • GRAM-142: Handling of extensions hashtable for jobmanager version command inconsistent
  • GRAM-143: GRAM5 doesn't send STAGE_OUT callback
  • GRAM-144: GRAM5 logging doesn't escape \r
  • GRAM-145: GRAM5 Job Manager fails to save SEG timestamps in job state files
  • Bug 6686: Condor SGE doesn't work well with GRAM2
  • Bug 6977: Bad name for syslog in gatekeeper

5. Known Problems

The following problems and limitations are known to exist for GRAM5 at the time of the 5.0.1 release:

5.1. Limitations

  • None at this time.

5.2. Outstanding bugs

  • GRAM-105: Held Condor jobs should be reported as SUSPENDED
  • GRAM-136: Error message not precise when disk quota is exceeded
  • GRAM-138: GRAM5 job manager uses a lot of memory when SEG is pointed to incorrect log path
  • GRAM-139: SEG may deadlock with threads
  • GRAM-146: GRAM5 Usage stats ignores GLOBUS_USAGE_TARGETS environment
  • Bug 5621: gram2 credential refresh problems in 4.0.5
  • Bug 1934: Gatekeeper's syslog output cannot be controlled
  • Bug 2739: Gatekeeper AuthZ/Gridmap Callout result logging
  • Bug 2741: catching SIGSEGV if dynamic loading of authorization modules fails
  • Bug 4199: Patch pre-WS GRAM to use individual condor logs for jobs
  • Bug 3795: jobmanager perl modules issues
  • Bug 4235: globus-job-manager doesn't exit if the job fails.
  • Bug 4730: MPI Jobs using Globus LSF in HP XC Cluster....
  • Bug 4747: Need evaluation of patch to JobManager.pm
  • Bug 4779: gram GT2 log files: timestamps are not ISO 8601 compatible
  • Bug 5143: DONE state never reported for Condor jobs when using Condor-G grid monitor
  • Bug 5429: stdin is lost when jobtype=multiple with jobmanager-lsf
  • Bug 5554: GRAM2 4.0.5 setup-globus-job-manager-fork.pl silent failure
  • Bug 5556: Audit directory setup instructions are insecure
  • Bug 5775: gram status of old jobs incorrect on some lsf systems
  • Bug 6184: pbs.pm jobmanager fails jobs on qstat failure
  • Bug 6337: Cannot configure globus to use different certificate path than default
  • Bug 6703: PBS scheduler adapter assumes that Globus is installed in the same location on the headnode of a cluster and on the work nodes.
  • Bug 6768: Held Condor jobs should be reported as SUSPENDED by GRAM
  • Bug 6815: Support standard install locations for globus-gram-protocol
  • Bug 6819: Missing metatdata in globus-scheduler-event-generator
  • Bug 6820: Support standard install locations for globus-gatekeeper
  • Bug 6821: Support standard install locations for globus-gatekeeper-setup
  • Bug 6822: Support standard install locations for globus-gram-job-manager-scripts
  • Bug 6823: Support standard install locations for globus-gram-job-manager
  • Bug 6824: Support standard install locations for globus-gram-job-manager-setup
  • Bug 6825: Remove hardcoded paths in globus-gram-job-manager-setup-fork
  • Bug 6826: Remove hardcoded paths in globus-gram-job-manager-setup-condor
  • Bug 6840: The PBS job manager doesn't handle large environments well
  • Bug 6855: Undefined variable in Makefiles
  • Bug 6862: PBS job manager fails if job history is enabled
  • Bug 6927: A Loadleveler LRM for GRAM5 should be very welcome
  • Bug 720: allow gram client to detect the version of a gram server
  • Bug 851: Add "cleanup" RSL attribute for cleaning up a job submission
  • Bug 5536: Missing dependency in package globus_gram_job_manager_auditing
  • Bug 5537: Missing dependency in package globus_gram_job_manager_auditing
  • Bug 3373: globus removes the temporary job directory before pbs writes back into it
  • Bug 5200: GRAM (pre-webservices) from OSG 0.6.0 (VDT 1.6.1) has bad syslog format
  • Bug 5207: GRAM SoftEnv extension bug
  • Bug 5250: Does not support mpi jobtype of RSL script
  • Bug 5272: Invalid parsing of RSL file

6. Technology dependencies

GRAM depends on the following GT components:

  • Globus Common
  • GSI C
  • GridFTP server

7. Tested platforms

Tested platforms for GRAM5:

  • Linux

    • CentOS 5.3 x86_64
    • Debain 4.0 x86_64

  • Mac OS X

    • Mac OS X 10.5.8

8. Backward compatibility summary

Protocol changes in GRAM since GT5.0.0 series:

  • The GRAM5 service uses a superset of the GRAM2 protocol for communciation between the client and service. The extensions supported in GRAM5 are implemented in such a way that they are ignored by GRAM2 services or clients. These extensions provide improved error messages and version detection.
  • GRAM5 does not support task coallocation using DUROC and its related protocols. Jobs submitted using DUROC directives will fail.
  • GRAM5 does not support file streaming. The standard output and standard error streams are sent after the job completes instead of during execution.

9. Associated Standards

None

10. For More Information

See GRAM5 for more information about this component.

Glossary

J

job scheduler

See the term scheduler.